[zfs-discuss] arc_reclaim pegs CPU and appears to block on _raw_spin_lock?

Andy Sparrow spuggy930 at gmail.com
Fri Apr 27 21:41:45 EDT 2018

Hi folks.

First-time posting to the list, please excuse any missing info - or just
ask & I'll try to provide it.

We're currently using a 4.4.x kernel with ZFS; want to update to a
later kernel for multiple good reasons, and of course, the corresponding
ZFS version (trying 4.15.1 and 0.7.6, respectively).

In, we constrained zfs_arc_max (to 14G on a 256G host) because we
noted that ARC couldn't respond fast enough to system pressure caused by a
new process randomly spinning up and trying to malloc() 10's of gigs of RAM
(ARC would start dumping, but in the meantime the new process would have
exited because of the "failed" malloc - yeh, no swap).

Initial testing with 4.15.1/0.7.6 seemed to go as expected, and we duly
booted a storage box (24 SSDs organized as 4 RAID-Z1 vdevs of 6 drives
each) with it to see how it did there.

After ~10 days of running, host began alerting for high disk write times,
arc_reclaim was pegged @ 100% CPU, with a bunch of arc_prune threads
consuming 30% CPU apiece.

Looking back through the historical record, CPU usage seemed to have been
trending up for arc_reclaim over time; at first these peaks would clear by
themselves, but overall trend was always upward (and the "floor" was

Eventually, arc_reclaim was hitting 100% regularly, and then later was
almost always at 99-100%. And after sitting there like that for some time,
actual disk write times began to be affected.

Actual disk I/O according to iostat was pretty minimal (<1%-util), with
iotop reporting 10-60MB/s read, 10-25MB/s write to the pool. So everything
seemed to be behaving normally, except for what appeared to be blocked

Reported disk write times climbed as high as 15 seconds before we rebooted
the box with the earlier kernel/ZFS version...

I believe I've managed to reproduce this issue on a box that we have a
little more flexibility to experiment on, by

1) constraining ARC really low (4G)
2) running a "normal" workload (actually, somewhat higher load than we'd
normally put on a single host).
3) adding multiple rsync & filesytem scans/deletes (to compensate for lack
of historical file deletion)

At first, the host appeared entirely unperturbed by the load from (1) &
(2), even after a day or two.  Then added (3) into the mix and a few hours
later, arc_reclaim CPU usage ramped up progressively (as did the arc_prune
threads). Until eventually arc_reclaim pegged at 100%, with most of the
arc_prune threads showing 30-35% CPU apiece. At which time perf top showed:

Samples: 2G of event 'cycles:ppp', Event count (approx.) 1043745486113

Overhead  Shared Object                      Symbol

  24.13%  [kernel]                           [k] _raw_spin_lock
  18.08%  [kernel]                           [k] zfs_prune
  10.54%  [kernel]                           [k] d_prune_aliases
  10.37%  [kernel]                           [k] _atomic_dec_and_lock
   7.96%  [kernel]                           [k] igrab
   7.75%  [kernel]                           [k] iput
   0.43%  [kernel]                           [k] queued_spin_lock_slowpath
   0.39%  [kernel]                           [k] osq_lock
   0.38%  [kernel]                           [k] arc_evict_state
   0.28%  [kernel]                           [k] arc_space_return

It seems as though if this additional load is removed before too long,
arc_reclaim can still "recover" - as in the CPU load will reduce to
"reasonable" levels and go back to changing dynamically as required.

However, it also seems as though if it is in this state for too long, it
can/will stay pegged even after much load is removed; after approximately
30 minutes of arc_reclaim being 100% CPU, I killed all the rsync/recursive
find/deletes (leaving only "normal" workload) and arc_reclaim did not
appear to get back on top of things after 10-15 minutes, staying pegged at
100% CPU.

Host has 72 E5-2697 cores, sys load was 2000%, with 4000% idle - so not
load that is the concern, rather that stuff appears to be blocking on a
critical resource...

On increasing ARC size to 6G, arc_reclaim immediately vanished from all
flavors of top, ARC filled up to 100%, and arcstat showed ARC stable and
happily maintaining the target size.

Arbitrarily reducing ARC size back to 4G made the symptoms come back;
arcstat showed actual ARC size bobbling between 4.1-4.2G, much CPU on
arc_reclaim and high _raw_spin_lock usage. Looked to be working hard - and
failing - to get back to target size.

Reducing the ARC size in smaller steps (e.g. to around what
zfs_arc_shrink_shift indicates, 10 currently) and allowing arc_reclaim some
time to stabilize between each reduction appeared to go somewhat better;
some CPU peaks immediately after the reduction, but they seemed not to be
impacting throughput too much.

However, even while arc_reclaim appeared fairly unremarkable - e.g. CPU
around 2-5% and arc_prune threads either not in the top table at all or
showing <1% CPU, _raw_spin_lock was showing ~5-6%.

We've got several other boxes booted with this same kernel/ZFS version
(albeit running a completely different workload on top of a RAID-10 pool)
that's been up for 30 days without anything like this being observed.

So; this behavior doesn't seem quite right...

I could easily persuaded to increase ARC size - except I'd really like some
warm fuzzies that the same problem won't simply happen again at the new
max, just taking longer to get there?

I've got more data (e.g. adbstats, arcstats) if anyone is interested.

Pool was created with:


And apart from:


pretty much everything is default.

Any suggestions as to how to avoid this/figure out what's non-optimal?

There's usually a 15/85 split in ARC between Recently/Frequently Used -
would be perfectly happy to see ARC to throw away whatever it needs to get
down to size.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20180427/d51ecd80/attachment-0001.html>

More information about the zfs-discuss mailing list