Re: memory testing

Wednesday, 15 July 2020

On Wed, 2020-07-15 at 11:11 -0600, Chris Murphy wrote:
...
 Hi,

 While bad RAM is uncommon, it comes up with some regularity to cause
 folks a lot of grief. I'm wondering if there's a way to make it
 easier
 to get bad news :-\ In particular there are cases where RAM defects
 just don't show up with a few hours of memtest86+, it can take days
 of
 contiguous testing, which is so inconvenient the test itself seems
 worse. 
An interesting feature many people don't know about is EDAC for ECC
RAM. When a memory error occurs, the kernel will log a message like:

EDAC MC0: CE page 0x6ba7a, offset 0x800, grain 128, syndrome 0xf8, row
0, channel 0, label "": i3000 CE

and keep a running count (since boot) under
/sys/devices/system/edac/mc. You can track down errors to a specific
memory stick (if you have a secret decoder ring for your motherboard).

At a previous employer, we wrote a custom nagios plugin to monitor that
and alert us for errors on our servers.

For more info, see edac-util and edac-ctl from the edac-utils package
and:

https://buttersideup.com/mediawiki/index.php/Main_Page

https://www.kernel.org/doc/html/latest/driver-api/edac.html

Of course you need ECC RAM, but if you care about memory errors, you
should be using it anyway.

...
 Here's what I've got so far:

 1. Fedora includes /boot/memtest86+-5.01 on every installation. But
 this is a legacy/BIOS program. The idea of recommending folks enable
 CSM/legacy BIOS just to test their RAM is questionable because it
 means disabling UEFI Secure Boot to do it. Lie in wait malware is
 perhaps rare but plausible.  UEFI native memtest86+ is not free so it
 can't be included. I kinda wonder if including this should be
 deprecated?

 2. The kernel has a built-in memory tester. Therefore it can run on
 anything. But how good is it? Is it worth enabling? Should it be
 enabled for all kernels or just debug kernels? The code is pretty
 simple, so will it catch only the worst cases of bad RAM?
 # CONFIG_MEMTEST is not set
 https://elixir.bootlin.com/linux/v5.8-rc4/source/mm/memtest.c

 3. "memory interface test" used at Google, Apache 2.0 license
 https://github.com/stressapptest/stressapptest

 4. "multiple concurrent kernel compiles" and "GCC seems to have
 memory
 usage patterns that reliably trigger memory errors that
 aren't caught by memtest"
 https://lore.kernel.org/linux-btrfs/799cf552-4612-56c5-b44d-59458119e2b0@...

 Example of btrfs catching a bit flip:
 https://lore.kernel.org/linux-btrfs/f42fc0d6-5dc9-dd15-9d61-53efb04fad33@...
 And also, this is not a good example of a memory tester. Some of the
 time the corruption happens before the csum is computed so, it's not
 going to catch everything.

 Any other ideas how to make this better?

 Thanks,
 -- 
 Chris Murphy -- 
Ken Gaillot <kgaillot(a)redhat.com&gt;

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: memory testing