[zfs-discuss] CHKSUM errors

Durval Menezes durval.menezes at gmail.com
Sat Feb 6 10:21:38 EST 2016


Hi Gordan,

On Sat, Feb 6, 2016 at 1:05 PM, Gordan Bobic via zfs-discuss <
zfs-discuss at list.zfsonlinux.org> wrote:

> On 06/02/16 11:47, Durval Menezes via zfs-discuss wrote:
>
>> Hello,
>>
>> On Feb 6, 2016 8:04 AM, "Gordan Bobic via zfs-discuss"
>> <zfs-discuss at list.zfsonlinux.org
>> <mailto:zfs-discuss at list.zfsonlinux.org>> wrote:
>>  >
>>  >
>>  >
>>  > On Fri, Feb 5, 2016 at 11:39 PM, Ruben Kelevra via zfs-discuss
>> <zfs-discuss at list.zfsonlinux.org
>> <mailto:zfs-discuss at list.zfsonlinux.org>> wrote:
>>  >>
>>  >> Hopefully you got an backup, if not it's time.
>>  >>
>>  >> Next step would be a ram-test, insert a memtest86+ bootdrive in,
>> might be on a stick or something and do a 24h test of your ram. If
>> you've enabled memory scrambling (might be not available on your
>> mainboard) disable it to be able to identify the broken chip.
>>  >
>>  >
>>  > It really surprises me that people still recommend memtest for
>> identifiying anything other than thoroughly, permanently broken memory
>> (which will manifest very obviously with crashes and errors all over the
>> place). memtest is notoriously useless at provoking errors in merely
>> marginal components, be it memory, MCH, or CPU. There are far better
>> tools for stability testing, and the best memtest can do is give a false
>> sense of security.
>>
>> Agreed. In my experience, memtest only detects the most obvious,
>> egregious errors.
>>
>>  > There is a script somewhere in the archives of this list that
>> automates testing with my prefered method when no better tools are
>> available. In a nutshell:
>>  >
>>  > 1) Boot into single user mode, make sure swap is disabled
>>  > 1) Create as big a tmpfs as possible (size $s), but make sure it
>> isn't too big and cause OOM errors during testing
>>  > 3) Create $n files from /dev/urandom where $n is the number of
>> hardware threads on your CPU. Make sure that each file is just under
>> $s/$n in size.
>>  > 4) Write a script that:
>>  > 4.1) Computes checksum of each file. For memory/MCH stability testing
>> use md5sum. For CPU stability testing use sha512sum. Store the output of
>> this in a variable.
>>  > 4.2) In an infinite loop, keep doing this, with n threads, one thread
>> per file created earlier, and compare the checksum computed in each
>> iteration with the one originally computed. If a discrepancy from the
>> originally computed checksum is encountered, report an error.
>>  >
>>  > I have seen this approach yield error reports in minutes after 48
>> hours of memtest86 found no problems at all.
>>
>> The script I use and recommend is this one:
>>
>> http://people.redhat.com/dledford/memtest.shtml
>>
>> Internally (and automatically) it does something very similar to the
>> above, only it uses a linux source kernel tree in place of /dev/urandom,
>> diff instead of md5sum, and it doesnt limit itself to the number of
>> processors but paralellizes just enough threads to overflow the buffer
>> cache (and therefore force disk I/O, which stresses the memory subsystem
>> even more due to concurrent access from the system peripheral bus, along
>> with power supply -- to power the disks, etc).
>>
>> Dledford's memtest is a classic in my labs, and I highly recommend it
>> along with a mprime running in "stress testing" mode to stress the last
>> bit of the machine's cpu/power supply/memory/etc.
>>
>> In my experience, if a machine goes through 24 hours of that combo with
>> no reports, then it's good not only for immediate production  use, but
>> will continue to be so for the next 2-3 years at least.
>>
>
> If it is forcing disk I/O it implicitly means it is under-stressing the
> memory I/O because the latter is orders of magnitude faster than the
> former.


Agreed, but don't forget that we are doing the above in a
multi-tasking/threading way (ie, there's more than one instance of
execution stressing memory); a few of those instances blocked waiting for
I/O won't diminish the hammering that the others are going to keep on the
memory. Also, that's why I recommend running mprime on its "stress-testing"
mode: their multiple threads are going to pick up any slack from the
dledford/memtest.sh processes.

Now that I think I've demonstrated that there will be no let-up on the
memory, I should say there's a good reason for doing I/O during the memory
testing: it is to stress multiple *types* of memory access. To be more
precise, most (all?) disk controllers in serious use today employ some sort
of DMA for their transfers, and I've seen situations where the memory
errors only show during DMA.


> Something that pushes the memory I/O paths as hard as possible will shake
> loose marginal memory much more reliably.


The idea is to keep hammering them as hard as possible, not only
volume-wise, but also type-wise.

Cheers,
-- 
   Durval.



>
>
> Gordan
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160206/0d5265e6/attachment-0001.html>


More information about the zfs-discuss mailing list