[zfs-devel] Reproducable hang when unmounting a filesystem on latest git tip

Chris Siebenmann cks at cs.toronto.edu
Mon May 25 11:07:50 EDT 2015


 I have a reproducable hang situation when I try to unmount a specific
ZFS filesystem using the latest git tip, 65037d9b25c2b as I write this.
The filesystem is theoretically unused at the time (lsof says nothing is
active) but has been used in the past. Oddly, other filesystems in the
same pool unmount fine (and have also been used); however their usage
may have been different, as the filesystem that consistently fails to
unmount is my home directory filesystem.

(I suspect that this may have been introduced in commit f0da4d15082,
'Wait for all znodes to be released before tearing down the superblock'.
I may try to test this by building and testing a 7a3066f version.)

 When this hang happens, a SysRq 'Show backtrace of all active CPUs'
reports one CPU is doing:
	May 25 10:48:16 hawkwind kernel: Call Trace:
	May 25 10:48:16 hawkwind kernel: [<ffffffffa03106c6>] taskq_wait+0x36/0x40 [spl]
	May 25 10:48:16 hawkwind kernel: [<ffffffffa044b52e>] zfs_sb_teardown+0x5e/0x430 [zfs]
	May 25 10:48:16 hawkwind kernel: [<ffffffffa044b956>] zfs_umount+0x36/0x100 [zfs]
	May 25 10:48:16 hawkwind kernel: [<ffffffffa046b5ec>] zpl_put_super+0x2c/0x40 [zfs]
	May 25 10:48:16 hawkwind kernel: [<ffffffff8121c286>] generic_shutdown_super+0x76/0x100
	May 25 10:48:16 hawkwind kernel: [<ffffffff8121c5b6>] kill_anon_super+0x16/0x30
	May 25 10:48:16 hawkwind kernel: [<ffffffffa046b75e>] zpl_kill_sb+0x1e/0x30 [zfs]
	May 25 10:48:16 hawkwind kernel: [<ffffffff8121c9a9>] deactivate_locked_super+0x49/0x60
	May 25 10:48:16 hawkwind kernel: [<ffffffff8121cdfc>] deactivate_super+0x6c/0x80
	May 25 10:48:16 hawkwind kernel: [<ffffffff8123ac13>] cleanup_mnt+0x43/0xa0
	May 25 10:48:16 hawkwind kernel: [<ffffffff8123acc2>] __cleanup_mnt+0x12/0x20
	May 25 10:48:16 hawkwind kernel: [<ffffffff810b8b14>] task_work_run+0xd4/0xf0
	May 25 10:48:16 hawkwind kernel: [<ffffffff81014cd7>] do_notify_resume+0x97/0xb0
	May 25 10:48:16 hawkwind kernel: [<ffffffff81775ca7>] int_signal+0x12/0x17

As far as I can tell from the SysRq data that gets logged, no other task
is doing ZFS or SPL stuff at the time (ie I can't find any '[zfs]' or
'[spl]' mentioned in other kernel task stack traces).

(I also observed that a 'sync' command hung after the 'umount' hung.)

 Is there any specific diagnostics I could look at in SysRq (or otherwise)
that would help people here? Since I can produce this hang with just
'umount /homes', I can test while the system is live. However I may not
be able to do testing on a regular basis as I'm not near the affected
system very often at the moment.

	- cks


More information about the zfs-devel mailing list