[zfs-discuss] Permanent errors in older snapshots

Håkan Johansson h_t_johansson at fastmail.fm
Sun Dec 25 10:37:11 EST 2016

On Sun, Dec 25, 2016, at 12:11 AM, Gordan Bobic via zfs-discuss wrote:

> As for data rotting silently on disks, it happens a lot. Some disks,
> WDs in particular seem to lie about their errors sometimes. I have
> seen WDs go from pending to no reallocated many times, and seen
> failing sectors move around, with SMART showing the disk as OK but ZFS
> showing it to be severely failing all over the place.

> IIRC CERN did some research on incidence of latent disk errors (disks
> returning duff data silently), and it is high enough that it cannot be
> ignored with non-trivial amounts of data.

This does not add up for me.  The original report had an raidz2 with one
disk offline, i.e. a permanent error requires simultaneous faults in two
disks in the same block.  One (permanent) error was reported by scrub,
but *no* errors that had been corrected.

Without the disks having any single/correctable errors in any other
blocks, their likelihood to cause errors by bit-rot is very small.
(With a few errors one could even calculate the probability, and not
just give an upper limit.)  Having seen much strangeness, I agree that
disks produce random faults, so am not arguing against that in any way.
But to see only this coincidence of two independent disks developing
errors in the same block and no single errors would be extremely
unlikely, if the cause is random bit rot.

If one should seek the cause of this error, I think one has to look
elsewhere - i.e. some sort of more 'coordinated' issue: controller
firmware, block device subsystem, filesystem driver.  Virtually
undebuggable.  One thing that would possibly give some hints from
incidents like this is if one could figure out what the blocks on disk
do contain.  If they agree on the data being stored (but mismatching
the checksum), or if the redundant copies contain data that does not
match itself.

Best regards,


