[zfs-discuss] arc_reclaim pegs CPU and appears to block on _raw_spin_lock?

Richard Yao ryao at gentoo.org
Sat Apr 28 14:49:00 EDT 2018


I suggest generating a flame graph when the system is in this state so that we can see what is happening:

https://github.com/brendangregg/FlameGraph/

You will need to make sure that kernel debug symbols are available though. I recall that you need ZFS_DKMS_DISABLE_STRIP='yes' in /etc/default/zfs on systems using DKMS to ensure that the modules are not stripped. If you mention your distribution, someone should know offhand how to ensure that you have debug symbols. Otherwise, the flame graph results will not be that useful.

Also, this probably is worth reporting in the issue tracker.

> On Apr 27, 2018, at 9:41 PM, Andy Sparrow via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> 
> 
> Hi folks.
> 
> First-time posting to the list, please excuse any missing info - or just ask & I'll try to provide it.
> 
> We're currently using a 4.4.x kernel with ZFS 0.6.5.8; want to update to a later kernel for multiple good reasons, and of course, the corresponding ZFS version (trying 4.15.1 and 0.7.6, respectively).
> 
> In 0.6.5.8, we constrained zfs_arc_max (to 14G on a 256G host) because we noted that ARC couldn't respond fast enough to system pressure caused by a new process randomly spinning up and trying to malloc() 10's of gigs of RAM (ARC would start dumping, but in the meantime the new process would have exited because of the "failed" malloc - yeh, no swap).
> 
> Initial testing with 4.15.1/0.7.6 seemed to go as expected, and we duly booted a storage box (24 SSDs organized as 4 RAID-Z1 vdevs of 6 drives each) with it to see how it did there. 
> 
> After ~10 days of running, host began alerting for high disk write times, arc_reclaim was pegged @ 100% CPU, with a bunch of arc_prune threads consuming 30% CPU apiece.
> 
> Looking back through the historical record, CPU usage seemed to have been trending up for arc_reclaim over time; at first these peaks would clear by themselves, but overall trend was always upward (and the "floor" was higher). 
> 
> Eventually, arc_reclaim was hitting 100% regularly, and then later was almost always at 99-100%. And after sitting there like that for some time, actual disk write times began to be affected.
> 
> Actual disk I/O according to iostat was pretty minimal (<1%-util), with iotop reporting 10-60MB/s read, 10-25MB/s write to the pool. So everything seemed to be behaving normally, except for what appeared to be blocked reads/writes.
> 
> Reported disk write times climbed as high as 15 seconds before we rebooted the box with the earlier kernel/ZFS version...
> 
> I believe I've managed to reproduce this issue on a box that we have a little more flexibility to experiment on, by 
> 
> 1) constraining ARC really low (4G)
> 2) running a "normal" workload (actually, somewhat higher load than we'd normally put on a single host).
> 3) adding multiple rsync & filesytem scans/deletes (to compensate for lack of historical file deletion)
> 
> At first, the host appeared entirely unperturbed by the load from (1) & (2), even after a day or two.  Then added (3) into the mix and a few hours later, arc_reclaim CPU usage ramped up progressively (as did the arc_prune threads). Until eventually arc_reclaim pegged at 100%, with most of the arc_prune threads showing 30-35% CPU apiece. At which time perf top showed:
> 
> Samples: 2G of event 'cycles:ppp', Event count (approx.) 1043745486113                                                                                                
> Overhead  Shared Object                      Symbol                                                                                                                    
>   24.13%  [kernel]                           [k] _raw_spin_lock
>   18.08%  [kernel]                           [k] zfs_prune
>   10.54%  [kernel]                           [k] d_prune_aliases
>   10.37%  [kernel]                           [k] _atomic_dec_and_lock
>    7.96%  [kernel]                           [k] igrab
>    7.75%  [kernel]                           [k] iput
>    0.43%  [kernel]                           [k] queued_spin_lock_slowpath
>    0.39%  [kernel]                           [k] osq_lock
>    0.38%  [kernel]                           [k] arc_evict_state
>    0.28%  [kernel]                           [k] arc_space_return
> 
> 
> It seems as though if this additional load is removed before too long, arc_reclaim can still "recover" - as in the CPU load will reduce to "reasonable" levels and go back to changing dynamically as required. 
> 
> However, it also seems as though if it is in this state for too long, it can/will stay pegged even after much load is removed; after approximately 30 minutes of arc_reclaim being 100% CPU, I killed all the rsync/recursive find/deletes (leaving only "normal" workload) and arc_reclaim did not appear to get back on top of things after 10-15 minutes, staying pegged at 100% CPU.
> 
> Host has 72 E5-2697 cores, sys load was 2000%, with 4000% idle - so not load that is the concern, rather that stuff appears to be blocking on a critical resource...
> 
> On increasing ARC size to 6G, arc_reclaim immediately vanished from all flavors of top, ARC filled up to 100%, and arcstat showed ARC stable and happily maintaining the target size.
> 
> Arbitrarily reducing ARC size back to 4G made the symptoms come back; arcstat showed actual ARC size bobbling between 4.1-4.2G, much CPU on arc_reclaim and high _raw_spin_lock usage. Looked to be working hard - and failing - to get back to target size.
> 
> Reducing the ARC size in smaller steps (e.g. to around what zfs_arc_shrink_shift indicates, 10 currently) and allowing arc_reclaim some time to stabilize between each reduction appeared to go somewhat better; some CPU peaks immediately after the reduction, but they seemed not to be impacting throughput too much.
> 
> However, even while arc_reclaim appeared fairly unremarkable - e.g. CPU around 2-5% and arc_prune threads either not in the top table at all or showing <1% CPU, _raw_spin_lock was showing ~5-6%.
> 
> We've got several other boxes booted with this same kernel/ZFS version (albeit running a completely different workload on top of a RAID-10 pool) that's been up for 30 days without anything like this being observed.
> 
> So; this behavior doesn't seem quite right... 
> 
> I could easily persuaded to increase ARC size - except I'd really like some warm fuzzies that the same problem won't simply happen again at the new max, just taking longer to get there?
> 
> I've got more data (e.g. adbstats, arcstats) if anyone is interested.
> 
> Pool was created with:
> 
> ashift=13
> primarycache=all
> xattr=sa
> atime=off
> compress=gzip-2
> 
> And apart from:
> 
> zfs_arc_shrink_shift=10
> zfs_arc_max=14G
> 
> pretty much everything is default.
> 
> Any suggestions as to how to avoid this/figure out what's non-optimal? 
> 
> There's usually a 15/85 split in ARC between Recently/Frequently Used - would be perfectly happy to see ARC to throw away whatever it needs to get down to size. 
> 
> 
> Thanks!
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20180428/45b388f1/attachment.html>


More information about the zfs-discuss mailing list