[zfs-discuss] ZoL on ARM - Semi-Success

Gordan Bobic gordan.bobic at gmail.com
Wed May 27 09:48:48 EDT 2015


Further update:

Resilvered my disks using zfs-fuse, without errors (*phew*).
Backed up data to a spare local array for each/quick restore (a few hundred
GB).

Re-installed kernel zfs binaries.
Pools imported fine again. Everything seems to be working fine (rsyncs,
serving out files over http), I did "find . -print0 | xargs -0 stat >
/dev/null", no obvious problems.

So I kicked off a scrub.

Within seconds, I got three cksum failures between two of the disks in the
pool.

Last time the result was two of the 4 disks (RAIDZ2) being kicked out of
the pool within 30 seconds or so with zpool status claiming they were
"corrupted".

Before anyone asks, the machine has no ECC (it is a small ARM based NAS box
with 1GB of RAM). But I am not convinced that the memory errors would be
happening quite this heavily, not when the machine has been running trouble
free for years with zfs-fuse.

I am purely guessing here, but could there be either a race condition, or
an interrupt storm, or an overload of some sort, or something along those
lines happening that could manifest in this way? I notice there are 115
options to the zfs kernel module. Is there one I could tweak to limit the
scrubbing throughput (IOPS or MB/s or something like that) to see if that
makes the problem go away to try to get closer to the root cause?

I just tried exporting and re-importing the pool and one of the disks
immediately showed up as unavailable. Export and import again, and 3/4
disks are showing as unavailable and the pool gets suspended.


There is nothing in dmesg or /var/log/messages to indicate any problems.


Is there a way to unsuspend the pool without rebooting the machine? The
disk device nodes are still responding just fine.

Seems I'll have to stick with zfs-fuse for the time being. :-(

Gordan


On Sat, May 23, 2015 at 3:45 PM, Gordan Bobic <gordan.bobic at gmail.com>
wrote:

> On Sat, May 23, 2015 at 2:52 PM, Gordan Bobic <gordan.bobic at gmail.com>
> wrote:
>
>> It seems I have this built and working against my binary kernel using the
>> vendor provided sources with a few bodges to facilitate building it, and
>> the difference in performance is quite striking. 1GB of RAM, armv5tel, no
>> special kernel module options.
>>
>> The most obvious difference is that I/O heavy workloads are no longer
>> saturating the CPU.
>>
>> Unfortunately it hasn't been entirely plain sailing. I decided to run a
>> scrub, and one of the disks fall off, with ZoL claiming it has corrupted
>> data. I stopped the scrub, exported and re-imported, and another disk fell
>> off. Since that's my entire margin for error gone, I chickened out, removed
>> zfs and reinstalled zfs-fuse to rebuild the array since that has worked
>> fine for the past year on the same machine.
>>
>> Nothing in the logs to indicate a problem, so I am not entirely sure what
>> might have caused this, certainly no disk resets or anything of the sort.
>>
>
>
>
> Right, so with the disks that fell out resilvering using zfs-fuse (70
> hours each for 4TB drives - ouch. That's a long time to be without
> redundancy.) lesson learned - don't try anything quite this adventurous on
> a system that would take an inconveniently long time to restore should it
> all go wrong.
>
> The downside is that I don't have another ARM machine with a bunch of SATA
> ports, and getting another QNAP TS-421 just for testing seems like a rather
> expensive way to do this. Maybe I can make a similar setup using a
> DreamPlug and a PMP (if PMPs work on the DreamPlug's SATA controller).
>
> Gordan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20150527/cea90a29/attachment-0001.html>


More information about the zfs-discuss mailing list