[zfs-discuss] Power failure + disks with different caching strategies
Jens Christian Hillerup
dvn869 at alumni.ku.dk
Mon Feb 29 11:08:26 EST 2016
Recently I was hired as a systems administrator. All systems had been more-or-less unmaintained for quite a while, but they *were* actually still running, most of them. However, one server had a ZFS pool that wouldn't import:
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The damage had happened before my employment, but I had gathered the machine had lost power at a very unfortunate point in time. Oh well. Joined #zfsonlinux and was told about rolling back to a certain TXG. So I did a `zdb -lu /dev/sdXY` on all of the drives to figure out which TXG was the most recent. The outputs can be fetched from http://kodekode.dk/uberblocks.tar.gz
For each of the files I grepped for 'txg = ', cut out the TXG number, sorted numerically and picked the highest one. Turns out that some disks are at transaction 12444874 while some are at 12444799.
DeHackEd from #zfsonlinux figured that maybe this had something to do with the SAS controller (SMC2108) and caching. Went to check (http://termbin.com/wgzj) and yeah, it totally did. In terms of caching strategy, the disks that had the high TXG were set to WriteThrough (sda, sdi, sdj, sdk) and the others were set to WriteBack.
Now, I've been trying to import the pool anyway using the undocumented -T flag, like so:
root at cicbackup:~# zpool import -F -T 12444799 -o readonly=on tank
cannot import 'tank': one or more devices is currently unavailable
... and I've tried to mix in the -X flag (though it should be implied by -T), which drastically increased the execution time but eventually yielded the same error.
My stance on this pool is that if I have to wipe, so be it, but it would be nice to have it imported and mounted, even with some data loss, to see what's there and to recover stuff before I start rebuilding things.
More information about the zfs-discuss