[zfs-discuss] Kernel panic when operating on ZFS Datasets

Durval Menezes durval.menezes at gmail.com
Thu Dec 14 03:22:54 EST 2017


Hello Martin,

On Dec 14, 2017 02:45, "Martin Ritchie via zfs-discuss" <
zfs-discuss at list.zfsonlinux.org> wrote:

Thanks for the suggestions guys. I did feel a bit risky going with a non
LTS ubuntu but had hoped for a newer verison of ZFS as a result.


I've stopped using Ubuntu here ages ago. Sometimes they managed to FUBAR
something critical even with their LTS releases... Alas, this follows a
general trend in Linuxworld of preferring flashy and new over reliable and
stable (and some distros and devs do that more than others). But I
digress...

I was tempted to build a new machine that has ECC ram for the storage but
ended up using my 'old' dev i7 desktop. I never had issues with it when it
was under heavy compilation or testing loads. But a good prime burn test
couldn't hurt.


Exactly. Even better, run (in parallel) mprime plus something else to
stress the disks (which will in turn stress the I/O controller, south
bridge, power supply, etc). I generally use dledford-memtest for this:
http://people.redhat.com/dledford/memtest


Could a bit flip really cause such issues?


A bit-flip can basically cause *anything*, and not just with ZFS. ZFS just
lets you know sooner, sometimes in a not-so-nice manner...

I did scrub the pool as I hoped that might fix things but the two data sets
still can't be read.


:-/ sad to hear it. This is not common, but not unheard of (I've seen it
once or twice before).

 I'd try with other ZFS implementations (zfs-fuse, freebsd, illumos, etc)
before giving up on them.

Also, as the issue is reproducible, it might be worth searching for, and if
not found, create a bug on the zfsonlinux GitHub.

The boot volume is fresh so after the festivities I'll drop back to LTS and
upgrade ZFS to 0.7.x (will try .3 first)

Thanks for the tips. I've been running hardware raid 6 for the last 9 years
and now that I need to upgrade the whole pool I thought moving to ZFS would
free me from some of the hardware foibles.


If the hardware is unreliable, unfortunately nothing in software will save
you from Murphy's demon...

Guess I just swapped it for some new ones :)


Or new manifestations from the same problems...

Perhaps now is time for that new Supermicro Denverton board
(A2SDi-4C-HLN4F),


Interesting little board and SoC, haven't seen it before. Guess it's the
new Avoton...

keep the i7 for compute and isolate the storage. Guess I need to get a back
up of this data pronto.


Actually you needed to get a back up of it the day before the issue :-\ but
as they say, hindsight is always 20/20.

Good luck, and keep us posted,
-- 
   Durval.


Cheers
Martin


On 13 December 2017 at 12:06, Chris Siebenmann <cks at cs.toronto.edu> wrote:

> > Now the real problem is that I can't do any filesystem operations on
> > pools earth, remote, remote/TM they all hang, processes blocked on IO
> > a quick ps grep:
>
>  Since you've had a kernel panic, all bets are off. It's quite likely
> that the kernel panic has caused subsequent problems that are blocking
> IO to some or all pools.
>
> > The only sign of what might be going wrong is a kernel panic stack
> > trace in dmesg:
>
>  This kernel panic and stack trace is a big red flag, especially
> because of what it is. I'm going to quote selective pieces of it:
>
>         Dec 11 21:35:40 earth kernel: [  140.258196] kernel BUG at
> /build/linux-tt6jd0/linux-4.13.0/lib/string.c:985!
>         [...]
>         Dec 11 21:35:40 earth kernel: [  140.258923] RIP:
> 0010:fortify_panic+0x13/0x22
>
> This is what the kernel paniced in. fortify_panic() in lib/string.c
> is an internal kernel routine that is used to panic if the kernel
> detects that kernel code is using string functions in an unsafe way
> that would lead to buffer overflows or the like. This is obviously
> not supposed to happen; if it does, something bad has happened and
> the system is unstable from that point onward.
>
>         Dec 11 21:35:40 earth kernel: [  140.259291] Call Trace:
>         Dec 11 21:35:40 earth kernel: [  140.259363]
> zfs_acl_node_read.constprop.16+0x31a/0x320 [zfs]
>
>  This is probably the function where the error happened; the source code
> is in module/zfs/zfs_acl.c. This function does call one thing that could
> call fortify_panic() (bcopy(), for people following along); however,
> clearly this code should not be creating a buffer overflow.
>
>  This code has experienced a number of changes since ZFS 0.6.5, although
> none of them have change messages that are of the nature 'fixed
> potential buffer overflow'. If you can, I would try the most recent
> version of ZFS (either the latest development version or at least 0.7.4)
> to see if that makes a difference.
>
>  It's possible that this is because of a hardware issue, such as a
> flipped RAM bit. It's also possible that you've found a genuine issue in
> the code.  If this happens again after rebooting and updating to 0.7.4,
> I'd suggest reporting it as a bug in the issue tracker.
>
>  If you can, I would also scrub the pool. Unfortunately it's possible
> that you've wound up with on-disk data that now causes this ZFS panic
> when it gets touched; I'm not sure if this can be fixed. But the first
> step is to scrub the pool after a reboot, and then try reading all data
> with something like 'tar -cf /dev/null ....'.
>
> (Technically you don't have to read the data, but you do have to somehow
> trigger looking at the potential ACLs for every file. Reading everything
> is the easy way to do this.)
>
>         - cks
>


_______________________________________________
zfs-discuss mailing list
zfs-discuss at list.zfsonlinux.org
http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20171214/caf6b69a/attachment-0001.html>


More information about the zfs-discuss mailing list