Native zfs vs zfs-fuse

devsk devsku at gmail.com
Mon Apr 25 21:22:21 EDT 2011


If I were to pick a starting point for improving the performance, I
would pick the filesystem traversal as the starting point. Its the
worst aspect of ZFS (native is only slightly worse than zfs-fuse here)
on Linux. Tools and utilities which do stuff like 'find' or 'stat' on
the whole or part of the FS (like updatedb, rkhunter or file indexers
etc.) are majorly suffering. The problem is that caching doesn't work
very well. Two successive calls still end up going to disk and doing
random IOs and taking the same amount of time.

-devsk

On Apr 25, 3:54 pm, Brian Behlendorf <behlendo... at llnl.gov> wrote:
> Interesting results!  Thanks for posting them to the list.  While I
> personally would have liked to see the native port perform better, I'm
> encouraged by the stability.  We may be getting to the point where it's
> worth spending some time looking in to the performance issues.  There
> really isn't any good reason the native implementation can't perform
> very well.  There's also likely a lot of low hanging fruit to be had
> since virtually no performance tuning has been done.
>
> With that I'm mind I'd like to suggest that if anyone has the time I'd
> be very interested in how changing the following parameters impacts
> performance on your system.  The current default values are the same as
> on OpenSolaris but that doesn't mean they are the best values for Linux.
> Here are a few things to try it's by no means an exhaustive list.
>
> Module options:
>   zfs_vdev_min_pending = 4
>   zfs_vdev_max_pending = 10
>   zfs_vdev_aggregation_limit = 131072
>   zfs_vdev_scheduler = [noop] anticipatory deadline cfq
>   zfs_prefetch_disable = 0
>   zfs_arc_max = 0
>   zfs_arc_meta_limit = 0
>
> Descriptions of the options can be found at the ZFS Evil Tuning Guide.
> And this week I'll make a pass through the code and expose the tuning
> options I've missed.
>
>  http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
>
> I'd also like to know where performance is the worst.  What specific
> workloads are you running which you expected to be fast but aren't
> performing.  I'd like to take a look at the worst offenders first.
>
> --
> Thanks,
> Brian
>
>
>
>
>
>
>
> On Sun, 2011-04-24 at 13:39 -0700, devsk wrote:
> > As I had mentioned yesterday, I migrated my pools to native zfs today.
> > I had a chance to compare the two. Note that this is not a very
> > exhaustive test. Its just one datapoint.
>
> > The tests use the same hardware, same pools and same software
> > configuration.
> > The ARC size used was 1268MB for both to limit the RAM usage by ZFS.
> > One thing
> > I noticed was that arc_reclaim kept the overall RAM usage by ZFS in
> > check,
> > whereas zfs-fuse doesn't have such checks and hence memory used by zfs-
> > fuse
> > process is much larger. This clearly hindered native ZFS's abilities
> > because
> > arc_reclaim was seen spinning on CPU during tests, hence blocking the
> > FS operations behind it.
>
> > The only major difference was that the kernel (2.6.38.4) uses
> > PREEMPT_VOLUNTARY for zfs-fuse
> > run and while native ZFS needed PREEMPT off in kernel config. This
> > could be one
> > of the reasons for complete lockup (mouse, keyboard, ssh sessions,
> > conky updates)
> > for several seconds while arc_reclaim spun the CPU during native ZFS
> > bonnie run on dedup pool.
>
> > zfs-fuse-0.6.9 and native ZFS master as of 6pm Apr 21, 2011 PST was
> > used for this test.
>
> > I. RAIDZ2 Pool of 5 drives without dedup:
>
> > 1. Bonnie, the useless test: I asked it to use 25GB of storage but the
> > results
> > show some crazy numbers. Looks like its not using random data but
> > highly compressed
> > data. Unfortunately, the FS I used for testing had compression
> > enabled, and I did
> > not want to change anything in the setup while comparing. In
> > hindsight, I should
> > have created a new FS of 30GB just for testing. But since the setup
> > was same for
> > native and fuse, the numbers may still be comparable. Its clear native
> > ZFS is
> > burning through data using higher amount of CPU. zfs-fuse numbers look
> > more sane.
>
> > Version  1.96       ------Sequential Output------ --Sequential Input-
> > --Random-
> > Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> > --Seeks--
> > Machine        Size K/sec %CP K/sec  %CP K/sec  %CP K/sec %CP K/sec
> > %CP  /sec %CP
> > native-zfs   25000M   213  89 934310  66 609655  72   547  99 1606875
> > 95 250.3   3
> > zfs-fuse     25000M    47  15 227311  13 185077  13 +++++ +++  749087
> > 13 283.1   1
> > Latency native      39345us     461ms     400ms   23241us
> > 112ms     213ms
> > Latency fuse          236ms    1136ms     944ms   18779us
> > 104ms     205ms
> > Version  1.96       ------Sequential Create------ --------Random
> > Create--------
> > native-zfs          -Create-- --Read--- -Delete-- -Create-- --Read--- -
> > Delete--
> >               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> > %CP  /sec %CP
> > native           16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +
> > ++++ +++
> > fuse             16 12921  15 +++++ +++ 18969  13 11634  15 +++++ +++
> > 18717  16
> > Latency native      66302us     246us     199us   12334us
> > 26us      82us
> > Latency fuse        10231us     663us     651us   31224us
> > 36us     195us
> > Finish Time native  2m27s
> > Finish Time fuse    6m37s
>
> > 2. Scrub times: zfs-fuse won this one clearly. This was a
> > disappointment for me, although not as bad as it looked initially.
>
> > zfs-fuse   : 1h26m
> > native zfs : 1h42m
>
> > 3. Write random data: 1GB of /dev/urandom is first copied to /var/tmp/
> > (tmpfs) and
> > then from there it is copied to ZFS.
>
> > zfs-fuse :
>
> > # time dd if=/var/tmp/tempfile of=tempfile bs=1M
> > 1000+0 records in
> > 1000+0 records out
> > 1048576000 bytes (1.0 GB) copied, 4.23787 s, 247 MB/s
>
> > native zfs:
> > # time dd if=/var/tmp/tempfile of=tempfile bs=1M
> > 1000+0 records in
> > 1000+0 records out
> > 1048576000 bytes (1.0 GB) copied, 3.09814 s, 338 MB/s
>
> > Native ZFS wins this one.
>
> > II. Similar tests on RAIDZ1 of 3 drives with dedup turned on.
>
> > 1. Bonnie, not so useless test: Because the dedup bottleneck is
> > unaffected by
> > compression, this was a fair comparison and native ZFS lost this one,
> > BIG TIME.
> > It took more than twice the time to finish. There were times when the
> > system
> > completely became unresponsive while running under native ZFS.
> > arc_reclaim was
> > seen hogging the CPU when the system did come back and conky screen
> > refreshed.
>
> > Note the latencies in native.
>
> > Version  1.96       ------Sequential Output------ --Sequential Input-
> > --Random-
> > Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> > --Seeks--
> > Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP  K/sec
> > %CP  /sec %CP
> > native       25000M   212  86 16151   1  6292   1   527  98 340504
> > 19   4.7   0
> > zfs-fuse     25000M    47  17 16396   1 14958   1 +++++ +++ 363892   6
> > 511.1   1
> > Latency native      38016us   11629ms   18148ms   42935us     764ms
> > 1924ms
> > Latency fuse          223ms   28442ms   57132ms   11621us
> > 12908ms     225ms
> > Version  1.96       ------Sequential Create------ --------Random
> > Create--------
> > zfs-fuse        -Create-- --Read--- -Delete-- -Create-- --Read--- -
> > Delete--
> >               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> > %CP  /sec %CP
> > native           16   128   0 +++++ +++   237   0 23613  91 27131  23 +
> > ++++ +++
> > fuse             16 10106  10 +++++ +++ 17083  14 11414  15 +++++ +++
> > 16112  12
> > Latency native         126s     361us   67626ms   11721us     222ms
> > 5676us
> > Latency fuse        44075us     692us    1596us   28914us
> > 37us     105ms
> > Finish Time native  128m25s
> > Finish Time fuse    57m12s
>
> > 2. Scrub time: Native ZFS wins this by a small margin.
>
> > zfs-fuse : 8h0m
> > native   : 7h56m
>
> > III. Some arbitrary native ZFS numbers (I should have remembered to
> > record these on zfs-fuse as well...hindsight 20/20!):
>
> > 1. Copy random data to dedup pool:
>
> > # time dd if=/var/tmp/tempfile of=tempfile bs=1M
> > 1000+0 records in
> > 1000+0 records out
> > 1048576000 bytes (1.0 GB) copied, 53.2026 s, 19.7 MB/s
> > real    0m53.208s
>
> > 2. Read a large file from dedup pool.
>
> > $ time dd if=movie.vob of=/dev/null bs=1M
> > 6353+1 records in
> > 6353+1 records out
> > 6662520832 bytes (6.7 GB) copied, 68.1495 s, 97.8 MB/s
> > real    1m8.150s
>
> > 3. Read a large file from non-dedup RAIDZ2 pool:
>
> > $ time dd if=movie.vob of=/dev/null bs=1M
> > 6353+1 records in
> > 6353+1 records out
> > 6662520832 bytes (6.7 GB) copied, 30.7714 s, 217 MB/s
> > real    0m30.772s
>
> > These numbers are definitely smaller than whatever I remember from zfs-
> > fuse but I did not record them.
>
> > IV. Notes for people planning to migrate:
>
> > 1. Make sure to create a 30GB FS BEFORE migration just for comparison
> > purposes, and make sure compression turned off on this FS.
> > 2. Run bonnie++ with storage size twice the RAM size.
> > 3. PREEMPT_NONE may kill your desktop responsiveness during heavy FS
> > stress. That's a loss if your ZFS server system is your desktop as
> > well. Be aware of this.
> > 4. Run more exhaustive set of tests. I was in so much hurry to see how
> > native ZFS performs with my RAIDZ2/RAIDZ1 pools that I did not record
> > exhaustive tests before migration.
>
> > Overall, I am sort of disappointed because I expected (unreasonably)
> > much better performance than zfs-fuse. This experience created a new
> > sense of respect for the zfs-fuse project...:-) But I am hopeful that
> > the performance will come with time and maturity of the project. This
> > is a great start as is!
>
> > -devsk
>
> > PS: Each drive in the RAIDZ is capable of doing 120+MB/s sequential
> > reads and writes. These speeds are never hit by native ZFS but zfs-
> > fuse hits them occasionally during sequential reads.



More information about the zfs-discuss mailing list