[zfs-discuss] ZoL on ARM - Semi-Success

Gordan Bobic gordan.bobic at gmail.com
Wed May 27 10:12:33 EDT 2015


Is there a way to force import of a pool that got suspended? I figure if
the pool was suspended because zfs thought too many disks were missing,
then I would expect the minimum viable number of disks to be available for
the pool in the pre-suspension state.

# zpool import -f -F -o readonly=on -d /dev/disk/by-id zfs
cannot import 'zfs': one or more devices is currently unavailable

But the device symlinks are most definitely there in /dev/disk/by-id/
they correspond to the correct nodes in /dev and all the disks are
responsive.

How can I debug this further and get the pool to import? I could just
restore it from the backup, but I figure trying to get it imported and
online again might be more useful from the point of view of what is going
wrong in the first place.

Gordan


On Wed, May 27, 2015 at 2:48 PM, Gordan Bobic <gordan.bobic at gmail.com>
wrote:

> Further update:
>
> Resilvered my disks using zfs-fuse, without errors (*phew*).
> Backed up data to a spare local array for each/quick restore (a few
> hundred GB).
>
> Re-installed kernel zfs binaries.
> Pools imported fine again. Everything seems to be working fine (rsyncs,
> serving out files over http), I did "find . -print0 | xargs -0 stat >
> /dev/null", no obvious problems.
>
> So I kicked off a scrub.
>
> Within seconds, I got three cksum failures between two of the disks in the
> pool.
>
> Last time the result was two of the 4 disks (RAIDZ2) being kicked out of
> the pool within 30 seconds or so with zpool status claiming they were
> "corrupted".
>
> Before anyone asks, the machine has no ECC (it is a small ARM based NAS
> box with 1GB of RAM). But I am not convinced that the memory errors would
> be happening quite this heavily, not when the machine has been running
> trouble free for years with zfs-fuse.
>
> I am purely guessing here, but could there be either a race condition, or
> an interrupt storm, or an overload of some sort, or something along those
> lines happening that could manifest in this way? I notice there are 115
> options to the zfs kernel module. Is there one I could tweak to limit the
> scrubbing throughput (IOPS or MB/s or something like that) to see if that
> makes the problem go away to try to get closer to the root cause?
>
> I just tried exporting and re-importing the pool and one of the disks
> immediately showed up as unavailable. Export and import again, and 3/4
> disks are showing as unavailable and the pool gets suspended.
>
>
> There is nothing in dmesg or /var/log/messages to indicate any problems.
>
>
> Is there a way to unsuspend the pool without rebooting the machine? The
> disk device nodes are still responding just fine.
>
> Seems I'll have to stick with zfs-fuse for the time being. :-(
>
> Gordan
>
>
> On Sat, May 23, 2015 at 3:45 PM, Gordan Bobic <gordan.bobic at gmail.com>
> wrote:
>
>> On Sat, May 23, 2015 at 2:52 PM, Gordan Bobic <gordan.bobic at gmail.com>
>> wrote:
>>
>>> It seems I have this built and working against my binary kernel using
>>> the vendor provided sources with a few bodges to facilitate building it,
>>> and the difference in performance is quite striking. 1GB of RAM, armv5tel,
>>> no special kernel module options.
>>>
>>> The most obvious difference is that I/O heavy workloads are no longer
>>> saturating the CPU.
>>>
>>> Unfortunately it hasn't been entirely plain sailing. I decided to run a
>>> scrub, and one of the disks fall off, with ZoL claiming it has corrupted
>>> data. I stopped the scrub, exported and re-imported, and another disk fell
>>> off. Since that's my entire margin for error gone, I chickened out, removed
>>> zfs and reinstalled zfs-fuse to rebuild the array since that has worked
>>> fine for the past year on the same machine.
>>>
>>> Nothing in the logs to indicate a problem, so I am not entirely sure
>>> what might have caused this, certainly no disk resets or anything of the
>>> sort.
>>>
>>
>>
>>
>> Right, so with the disks that fell out resilvering using zfs-fuse (70
>> hours each for 4TB drives - ouch. That's a long time to be without
>> redundancy.) lesson learned - don't try anything quite this adventurous on
>> a system that would take an inconveniently long time to restore should it
>> all go wrong.
>>
>> The downside is that I don't have another ARM machine with a bunch of
>> SATA ports, and getting another QNAP TS-421 just for testing seems like a
>> rather expensive way to do this. Maybe I can make a similar setup using a
>> DreamPlug and a PMP (if PMPs work on the DreamPlug's SATA controller).
>>
>> Gordan
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20150527/cff1f150/attachment.html>


More information about the zfs-discuss mailing list