[zfs-discuss] Kernel panic when operating on ZFS Datasets
cks at cs.toronto.edu
Wed Dec 13 12:06:14 EST 2017
> Now the real problem is that I can't do any filesystem operations on
> pools earth, remote, remote/TM they all hang, processes blocked on IO
> a quick ps grep:
Since you've had a kernel panic, all bets are off. It's quite likely
that the kernel panic has caused subsequent problems that are blocking
IO to some or all pools.
> The only sign of what might be going wrong is a kernel panic stack
> trace in dmesg:
This kernel panic and stack trace is a big red flag, especially
because of what it is. I'm going to quote selective pieces of it:
Dec 11 21:35:40 earth kernel: [ 140.258196] kernel BUG at
Dec 11 21:35:40 earth kernel: [ 140.258923] RIP: 0010:fortify_panic+0x13/0x22
This is what the kernel paniced in. fortify_panic() in lib/string.c
is an internal kernel routine that is used to panic if the kernel
detects that kernel code is using string functions in an unsafe way
that would lead to buffer overflows or the like. This is obviously
not supposed to happen; if it does, something bad has happened and
the system is unstable from that point onward.
Dec 11 21:35:40 earth kernel: [ 140.259291] Call Trace:
Dec 11 21:35:40 earth kernel: [ 140.259363] zfs_acl_node_read.constprop.16+0x31a/0x320 [zfs]
This is probably the function where the error happened; the source code
is in module/zfs/zfs_acl.c. This function does call one thing that could
call fortify_panic() (bcopy(), for people following along); however,
clearly this code should not be creating a buffer overflow.
This code has experienced a number of changes since ZFS 0.6.5, although
none of them have change messages that are of the nature 'fixed
potential buffer overflow'. If you can, I would try the most recent
version of ZFS (either the latest development version or at least 0.7.4)
to see if that makes a difference.
It's possible that this is because of a hardware issue, such as a
flipped RAM bit. It's also possible that you've found a genuine issue in
the code. If this happens again after rebooting and updating to 0.7.4,
I'd suggest reporting it as a bug in the issue tracker.
If you can, I would also scrub the pool. Unfortunately it's possible
that you've wound up with on-disk data that now causes this ZFS panic
when it gets touched; I'm not sure if this can be fixed. But the first
step is to scrub the pool after a reboot, and then try reading all data
with something like 'tar -cf /dev/null ....'.
(Technically you don't have to read the data, but you do have to somehow
trigger looking at the potential ACLs for every file. Reading everything
is the easy way to do this.)
More information about the zfs-discuss