[zfs-discuss] Scrub shows CKSUM error, disks don't record any...
gordan.bobic at gmail.com
Thu May 23 04:29:18 EDT 2013
On 05/23/2013 08:59 AM, Swâmi Petaramesh wrote:
> I have an Ubuntu 13.04 running on top of a ZFS pool made from a mirror or 2x 1
> TB SATA disks.
> The machine is working fine and has always been, perfectly stable, so I assume
> that the hardware, CPU and RAM are OK.
> The ZFS pool itself is a bit more than 2 years old, AFAIR.
> SMART shows that all the HDs are healthy and have recorded no bad or relocated
> sectors ever.
Disks like about this. I have observed some WD and Samsung models in
particular to do this. Pending sectors appear, you overwrite them,
pending goes to 0, reallocated stays at 0.
> As the disks are getting a bit old however, I've started to scrub the pool.
You should not be assuming reliability based on age. Disks follow a
typical bathtub failure curve, so very new disks are among the more
likely to fail.
> To my surprise, scrub has found ~70 CKSUM errors on one disk the 1st time -
> and could "fix" it successfully.
> - Then it ran once without finding any error.
> - Then it ran once and reported around ~50 CKSUM errors on each disk, still
> telling that it had "fixed" this OK.
You could be having a PSU going duff - power getting noisy is known to
wreak havoc with disks.
> Still, the disks SMART don't show any recorded error whatsoever, no sector got
> either reallocated or pending reallocation, ever.
As I said - DO NOT TRUST the SMART reallocated counts, unless your disk
is showing some reallocated sectors already to prove it logs them correctly.
> Disks pass all the SMART offline tests perfectly well.
> Given that SATA disks include error detection and correction code, and that
> the SATA link is end-to-end CRC'd, how comes that ZFS scrub could detect and
> "fix" CKSUM errors, where the disk itself and SATA subsystem never detected any
One of the two reasons I mentioned above. Very common. I see it
literally every day on various servers.
> I'm at the point where I'm VERY puzzled. Should I suspect my disks to be
> slowly dying without telling ? Should I suspect bad RAM ? Should I suspect a
> weird bug in the ZFS code ?
Bad RAM is a possibility if you don't have ECC RAM. Is the machine
IMO ZFS bug is the least likely explanation in this case.
More information about the zfs-discuss