[zfs-discuss] Scrub shows CKSUM error, disks don't record any...

Gordan Bobic gordan.bobic at gmail.com
Thu May 23 04:29:18 EDT 2013


On 05/23/2013 08:59 AM, Swâmi Petaramesh wrote:
> Hi,
>
> I have an Ubuntu 13.04 running on top of a ZFS pool made from a mirror or 2x 1
> TB SATA disks.
>
> The machine is working fine and has always been, perfectly stable, so I assume
> that the hardware, CPU and RAM are OK.
>
> The ZFS pool itself is a bit more than 2 years old, AFAIR.
>
> SMART shows that all the HDs are healthy and have recorded no bad or relocated
> sectors ever.

Disks like about this. I have observed some WD and Samsung models in 
particular to do this. Pending sectors appear, you overwrite them, 
pending goes to 0, reallocated stays at 0.

See:
http://www.altechnative.net/2011/03/21/the-appalling-design-of-hard-disks/

> As the disks are getting a bit old however, I've started to scrub the pool.

You should not be assuming reliability based on age. Disks follow a 
typical bathtub failure curve, so very new disks are among the more 
likely to fail.

> To my surprise, scrub has found ~70 CKSUM errors on one disk the 1st time -
> and could "fix" it successfully.
>
> - Then it ran once without finding any error.
>
> - Then it ran once and reported around ~50 CKSUM errors on each disk, still
> telling that it had "fixed" this OK.

You could be having a PSU going duff - power getting noisy is known to 
wreak havoc with disks.

> Still, the disks SMART don't show any recorded error whatsoever, no sector got
> either reallocated or pending reallocation, ever.

As I said - DO NOT TRUST the SMART reallocated counts, unless your disk 
is showing some reallocated sectors already to prove it logs them correctly.

> Disks pass all the SMART offline tests perfectly well.
>
> Given that SATA disks include error detection and correction code, and that
> the SATA link is end-to-end CRC'd, how comes that ZFS scrub could detect and
> "fix" CKSUM errors, where the disk itself and SATA subsystem never detected any
> ?

One of the two reasons I mentioned above. Very common. I see it 
literally every day on various servers.

> I'm at the point where I'm VERY puzzled. Should I suspect my disks to be
> slowly dying without telling ? Should I suspect bad RAM ? Should I suspect a
> weird bug in the ZFS code ?

Bad RAM is a possibility if you don't have ECC RAM. Is the machine 
overclocked?

IMO ZFS bug is the least likely explanation in this case.

Gordan



More information about the zfs-discuss mailing list