[zfs-discuss] zpool scrub repeatedly detects checksum errors
philip.papadopoulos at gmail.com
Fri Sep 13 10:40:02 EDT 2013
You have bad hardware, plain and simple. If you are insistent on
using hardware raid (instead of letting ZFS see the raw disk
directly), then you need to make sure that your RAID BIOS is the
latest. If it is, and ZFS is still reporting
checksum errors, I would point to bad RAM within the RAID controller itself.
it's NOT a software problem. You should actually write the inventors
of ZFS and thank them. It is telling you that your hardware is lying
to you instead of merrily feeding you erroneous and corrupted data.
I've been in your shoes. The raid controller was an Areca controller
just presenting disks as JBODs. Firmware updates helped quite a bit
in the early days. Finally kicked a really flaky one to the curb in
favor of an Adaptec HBA. Routine scrubs. Zero Checksum Errors. What
was changed? The adapter.
On Fri, Sep 13, 2013 at 6:21 AM, Achim Gottinger <achim at ag-web.biz> wrote:
> Am 13.09.2013 14:18, schrieb Gregor Kopka:
>> Am 13.09.2013 12:21, schrieb Achim Gottinger:
>>> This is the output with results from the second scrub after i had checked
>>> involved raid1's and discs for error.
>>> Disc's and raid's are ok,
>> I tend to argue against that.
>>> the system ran flawless since the previous scrub
>> Then why do you have errors in the data?
>>> and i had cleaned the errors before i ran scrub again.
>>> I wonder where these checksum erros come from because i had expected they
>>> had been fixed during the first scrub and by me removing the affected two
>>> I run zfs/spl verision 0.6.2 btw.
>>> pool: zpool
>>> state: ONLINE
>>> status: One or more devices has experienced an unrecoverable error. An
>>> attempt was made to correct the error. Applications are unaffected.
>>> action: Determine if the device needs to be replaced, and clear the
>>> using 'zpool clear' or replace the device with 'zpool replace'.
>>> see: http://zfsonlinux.org/msg/ZFS-8000-9P
>>> scan: scrub repaired 1,07M in 5h51m with 0 errors on Fri Sep 13
>>> 01:15:56 2013
>>> NAME STATE READ WRITE CKSUM
>>> zpool ONLINE 0 0 0
>>> xvde ONLINE 0 0 446
>>> xvdf ONLINE 0 0 459
>>> xvda3 ONLINE 0 0 0
>>> xvdd1 ONLINE 0 0 0
>>> errors: No known data errors
>> This is _bad_.
>> Given your limitation, in case you have a backup of the pool (or can make
>> 1) find out why your hardware feeds you defective data and fix the issue!
>> => Have you checked the memory?
> System has 32gb registered ecc memory but the hosts kernel has no edac
> support builtin so i can't look for errors with edac-utils.
> As an first test i disable read and write caching for the raid1's on the
> adaptec controller and i'll remove the log and cache devices from the pool.
> Cleared the errors and started zpool scrub and it is still finding erros.
> Guess i'll have to scrub twice to make shure there are no more new errors
> I have an backup on an additional 4TB hd, last time i send snapshots over to
> that disc was after the first scrub test. Did an scrub of that disc
> afterwards and it has no checksum errors. Disc is connected to an onboard
> sata port, that onboard sata controller is passen thru to the vm, i may get
> another 4tb drive and use that controller for an simple mirror if i can not
> figure out the problem in my currents setup.
>> 2) hand the 4 drives to the VM JBOD style, partition them each for a small
>> space for system to boot from, rest for data pool (2 mirrors) which you then
>> restore from backup.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to zfs-discuss+unsubscribe at zfsonlinux.org.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe at zfsonlinux.org.
Philip Papadopoulos, PhD
University of California, San Diego
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.
More information about the zfs-discuss