Can someone explain this result?

Gordan Bobic gordan.bobic at gmail.com
Mon May 9 18:48:39 EDT 2011


On 05/09/2011 09:02 PM, Brian Behlendorf wrote:
> A couple thoughts about this.  The second 'find' may counter intuitively
> run slower than the first if your walking a large enough number of files
> such that they cannot all be cached.  I believe this fits with your
> observation that you don't see this for small numbers of files.

With even a few hundred MB of RAM, that's a LOT of inodes - find should 
only be checking inode data, not the file contents. I don't think cache 
overflows are the issue here. Otherwise, ext* file systems would be 
suffering the same problem, and I've been finding that find is much 
faster on ext* than on zfs - ext* seems to benefit from the caches, 
whereas zfs always seems just as slow as it was the first time around.

> The reason is that on the first pass you simply need to allocate enough
> memory to cache the new inode.  But on the second pass, you need to not
> only allocate enough memory for the new inode.  But you may also need to
> pick and inode to drop from the cache and have ZFS free the resources
> associated with it.  Right now we rely on Linux which uses a simple LRU
> to determine which inode should be dropped.  That means that cache hits
> will be rare if the meta data for all the files doesn't fit in cache.
> And remember we are strictly honoring the zfs_meta_limit now.

I have had the opportunity to compare performance between ext4 and zfs 
on find with identical multi-TB data sets, and I really don't think that 
cache churn is the issue or it would affect both file systems the same 
way, would it not? In both cases, there was still several GB of RAM left 
after the find. Unless something about ZFS is specifically limiting the 
number of inodes cached at any one time, of course.

> Ideally, the overhead for this will be low but I haven't had a chance to
> profile,  It's certainly true there may be some lock contention here but
> how bad it is I haven't profiled yet.
>
> Concerning gzip compression performing worse I have a hunch about that
> as well.  The gzip algorithm (unlike lzjb) requires a fair bit of
> scratch working space in memory.  This memory needs to be contiguous
> mapped in memory which means we need to use vmalloc().  Now vmalloc()
> has always been notoriously slow in the kernel, and ZFS already puts
> considerable pressure on it.  So the cost of vmalloc()'ing the needed
> buffer space may be slowing things down considerably here.

I've only ever used default compression which I assume is lzjb, and the 
first thing I found odd was that performance reduced but the CPU usage 
barely increased. I would have expected compression to only cause a 
performance degradation if CPU starts to bottleneck.

Similarly, with dedupe enabled, I've found that caching was very 
ineffective at caching the block hashes, which I expected to fit into 
memory (got 8GB of RAM in the machine, and with SHA256 hash per 128KB 
block, that should have comfortably allowed each block hash to fit into 
caches). The reason I say that is because there were a lot of reads 
hitting the disks while copying files to zfs off the network. Not sure 
if that is related to the above issues, but it really does look like 
either certain things aren't getting cached at all, or something else is 
going wrong.

Gordan



More information about the zfs-discuss mailing list