a difficult-to-describe problem involving long write times and enormous load levels

Daniel Brooks dlb48x at gmail.com
Sat May 21 16:55:20 EDT 2011

I'm running a slightly crazy workload on my machine at the moment and every
time I leave for a few hours to work, or sleep, etc it goes and kills my
machine. I'm trying to figure out why, and the only answer I can come up
with is that ZFS is slowly becoming less and less responsive, leading to a
huge backlog of writes that never gets finished. The workload that I'm
running will create a large number of (mostly small) files, perhaps 1-2
million files totaling 200-300 GB. There's also a background service that
snapshots some of my filesystems at regular intervals. Since this has so far
only happened while I was away from the machine, my only clues are that the
load average was 180 this morning when I logged in via ssh. The box was
still operating normally for the root user, as long as I didn't interact
with a zfs filesystem. The box was even still acting as a gateway for the
rest of my network. In fact, there was essentially zero cpu usage, and
although most of my 8 gigs of ram was in use, that wasn't unexpected. There
was essentially no swap usage at all, which is normal. Is there any way I
can try to narrow down the actual cause?

I'm at revision 372c2572336468cbf60272aa7e735b7ca0c3807c for spl and
3fd70ee6b0bc9fa74b7ef87657b9cc3b0304f689 for zfs.

  pool: tank
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scan: scrub repaired 0 in 24h37m with 0 errors on Fri Feb 25 02:43:52 2011

        NAME                                         STATE     READ WRITE
        tank                                         ONLINE       0
0     0
          raidz1-0                                   ONLINE       0
0     0
            ata-WDC_WD20EARS-00S8B1_WD-WCAVY2304352  ONLINE       0
0     0
            ata-WDC_WD20EARS-00S8B1_WD-WCAVY2449350  ONLINE       0
0     0
            ata-WDC_WD20EARS-00S8B1_WD-WCAVY2454409  ONLINE       0
0     0
            wwn-0x50014ee2aeb3644d                   ONLINE       0
0     0
            ata-WDC_WD20EARS-00S8B1_WD-WCAVY2657307  ONLINE       0
0     0

errors: No known data errors

NAME                                                     USED  AVAIL  REFER
tank                                                    4.88T  2.25T  52.7K
tank/bbgun06                                             272K  2.25T   239K
tank/crowder                                            8.20G  2.25T  8.20G
tank/db48x                                              4.87T  2.25T   171G
tank/db48x/Maildir                                      1.57G  2.25T  1.57G
tank/db48x/archives                                      675G  2.25T  1.13G
tank/db48x/archives/BattleshipPotemkin                  4.58G  2.25T  4.58G
tank/db48x/archives/Cyrano_DeBergerac                   8.32G  2.25T  8.32G
tank/db48x/archives/Friendster                          82.7G  2.25T  82.7G
tank/db48x/archives/GoogleVideo                          528G  2.25T   528G
tank/db48x/archives/National Jukebox                     135K  2.25T   135K
/home/db48x/archives/National Jukebox
tank/db48x/archives/PS3-HyperVisor-Bible                 164M  2.25T   164M
tank/db48x/archives/TripDownMarketStreetrBeforeTheFire  2.39G  2.25T  2.39G
tank/db48x/archives/Vancouver_Open_Data_Catalogue       26.5G  2.25T  18.7G
tank/db48x/archives/WikiLeaks                           16.5G  2.25T  1.40G
tank/db48x/archives/WikiLeaks/complete                  15.1G  2.25T  15.1G
tank/db48x/archives/bbc.closing.sites.archive           1.89G  2.25T  1.89G
tank/db48x/archives/lulupoetry                          1.96G  2.25T  1.96G
tank/db48x/archives/www.astronautix.com                 1.07G  2.25T  1.07G
tank/db48x/caches                                       3.94M  2.25T  3.94M
tank/db48x/logs                                          158M  2.25T   158M
tank/db48x/music                                        47.7G  2.25T  47.7G
tank/db48x/offlineimap-maildir                          2.60G  2.25T  2.60G
tank/db48x/projects                                      206G  2.25T  73.8G
tank/db48x/projects/Inventory                            112G  2.25T   112G
tank/db48x/projects/mdg                                 20.0G  2.25T  20.0G
tank/db48x/secure                                       33.6K  2.25T  33.6K
tank/db48x/video                                        3.71T  2.25T  3.71T
tank/db48x/vm                                           12.5G  2.25T  12.5G
tank/db48x/windows-backup                               67.7G  2.25T  67.7G
tank/jcranmer                                            108K  2.25T  75.9K
tank/makrond                                            36.3M  2.25T  36.3M
tank/olad                                               1008M  2.25T  1007M
tank/osldgoth                                            355M  2.25T   355M
tank/roadkill                                           65.5K  2.25T  65.5K
tank/stephen                                            99.1K  2.25T  68.7K
tank/xp3                                                48.1M  2.25T  48.1M

Here's part of the history:

2011-05-21.00:00:23 zfs snapshot tank/db48x at auto-2011-05-21_00.00
2011-05-21.00:30:41 zfs snapshot tank/db48x at auto-2011-05-21_00.30
2011-05-21.01:00:56 zfs snapshot tank/db48x at auto-2011-05-21_01.00
2011-05-21.01:30:34 zfs snapshot tank/db48x at auto-2011-05-21_01.30
2011-05-21.10:33:46 zfs destroy tank/db48x at auto-2011-05-20_19.30
2011-05-21.10:33:48 zfs destroy tank/db48x at auto-2011-05-20_20.30
2011-05-21.10:33:49 zfs destroy tank/db48x at auto-2011-05-20_21.00
2011-05-21.10:33:51 zfs destroy tank/db48x at auto-2011-05-20_21.30
2011-05-21.10:33:52 zfs destroy tank/db48x at auto-2011-05-20_22.00
2011-05-21.10:33:54 zfs destroy tank/db48x at auto-2011-05-20_22.30
2011-05-21.11:00:08 zfs snapshot tank/db48x at auto-2011-05-21_11.00

You can see it creating snapshots every half hour, then at some point after
1:30am it doesn't create them any more. Then at 10:30am I wake up and in the
end reboot the computer to bring it back. It then deletes some of the older
snapshots that would have been deleted a few hours earlier if it had been
able to do so.

Is there any other information I can supply?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20110521/d4ccd960/attachment.html>

More information about the zfs-discuss mailing list