Native zfs vs zfs-fuse

devsk devsku at gmail.com
Sun Apr 24 16:46:53 EDT 2011


---------- One more try, sorry for spamming! Ignore the earlier email
-----

As I had mentioned yesterday, I migrated my pools to native zfs today.
I had a chance to compare the two. Note that this is not a very
exhaustive test. Its just one datapoint.

The tests use the same hardware, same pools and same software
configuration. The ARC size used was 1268MB for both to limit the RAM
usage by ZFS. One thing I noticed was that arc_reclaim kept the
overall RAM usage by ZFS in check, whereas zfs-fuse doesn't have such
checks and hence memory used by zfs-fuse process is much larger. This
clearly hindered native ZFS's abilities because arc_reclaim was seen
spinning on CPU during tests, hence blocking the FS operations behind
it.

The only major difference was that the kernel (2.6.38.4) uses
PREEMPT_VOLUNTARY for zfs-fuse run and while native ZFS needed PREEMPT
off in kernel config. This could be one of the reasons for complete
lockup (mouse, keyboard, ssh sessions, conky updates) for several
seconds while arc_reclaim spun the CPU during native ZFS bonnie run on
dedup pool.

zfs-fuse-0.6.9 and native ZFS master as of 6pm Apr 21, 2011 PST was
used for this test.

I. RAIDZ2 Pool of 5 drives without dedup:

1. Bonnie, the useless test: I asked it to use 25GB of storage but the
results show some crazy numbers. Looks like its not using random data
but highly compressed data. Unfortunately, the FS I used for testing
had compression enabled, and I did not want to change anything in the
setup while comparing. In hindsight, I should have created a new FS of
30GB just for testing. But since the setup was same for native and
fuse, the numbers may still be comparable. Its clear native ZFS is
burning through data using higher amount of CPU. zfs-fuse numbers look
more sane.

Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec  %CP K/sec  %CP K/sec %CP K/sec
%CP  /sec %CP
native-zfs   25000M   213  89 934310  66 609655  72   547  99 1606875
95 250.3   3
zfs-fuse     25000M    47  15 227311  13 185077  13 +++++ +++  749087
13 283.1   1
Latency native      39345us     461ms     400ms   23241us
112ms     213ms
Latency fuse          236ms    1136ms     944ms   18779us
104ms     205ms
Version  1.96       ------Sequential Create------ --------Random
Create--------
native-zfs          -Create-- --Read--- -Delete-- -Create-- --Read--- -
Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP  /sec %CP
native           16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +
++++ +++
fuse             16 12921  15 +++++ +++ 18969  13 11634  15 +++++ +++
18717  16
Latency native      66302us     246us     199us   12334us
26us      82us
Latency fuse        10231us     663us     651us   31224us
36us     195us
Finish Time native  2m27s
Finish Time fuse    6m37s

2. Scrub times: zfs-fuse won this one clearly. This was a
disappointment for me, although not as bad as it looked initially.

zfs-fuse   : 1h26m
native zfs : 1h42m

3. Write random data: 1GB of /dev/urandom is first copied to /var/tmp/
(tmpfs) and then from there it is copied to ZFS.

zfs-fuse :

# time dd if=/var/tmp/tempfile of=tempfile bs=1M
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.23787 s, 247 MB/s

native zfs:
# time dd if=/var/tmp/tempfile of=tempfile bs=1M
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.09814 s, 338 MB/s

Native ZFS wins this one.

II. Similar tests on RAIDZ1 of 3 drives with dedup turned on.

1. Bonnie, not so useless test: Because the dedup bottleneck is
unaffected by compression, this was a fair comparison and native ZFS
lost this one, BIG TIME.  It took more than twice the time to finish.
There were times when the system completely became unresponsive while
running under native ZFS. arc_reclaim was seen hogging the CPU when
the system did come back and conky screen refreshed.

Note the latencies in native.

Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP  K/sec
%CP  /sec %CP
native       25000M   212  86 16151   1  6292   1   527  98 340504
19   4.7   0
zfs-fuse     25000M    47  17 16396   1 14958   1 +++++ +++ 363892   6
511.1   1
Latency native      38016us   11629ms   18148ms   42935us     764ms
1924ms
Latency fuse          223ms   28442ms   57132ms   11621us
12908ms     225ms
Version  1.96       ------Sequential Create------ --------Random
Create--------
zfs-fuse        -Create-- --Read--- -Delete-- -Create-- --Read--- -
Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP  /sec %CP
native           16   128   0 +++++ +++   237   0 23613  91 27131  23 +
++++ +++
fuse             16 10106  10 +++++ +++ 17083  14 11414  15 +++++ +++
16112  12
Latency native         126s     361us   67626ms   11721us     222ms
5676us
Latency fuse        44075us     692us    1596us   28914us
37us     105ms
Finish Time native  128m25s
Finish Time fuse    57m12s

2. Scrub time: Native ZFS wins this by a small margin.

zfs-fuse : 8h0m
native   : 7h56m

III. Some arbitrary native ZFS numbers (I should have remembered to
record these on zfs-fuse as well...hindsight 20/20!):

1. Copy random data to dedup pool:

# time dd if=/var/tmp/tempfile of=tempfile bs=1M
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 53.2026 s, 19.7 MB/s
real    0m53.208s

2. Read a large file from dedup pool.

$ time dd if=movie.vob of=/dev/null bs=1M
6353+1 records in
6353+1 records out
6662520832 bytes (6.7 GB) copied, 68.1495 s, 97.8 MB/s
real    1m8.150s

3. Read a large file from non-dedup RAIDZ2 pool:

$ time dd if=movie.vob of=/dev/null bs=1M
6353+1 records in
6353+1 records out
6662520832 bytes (6.7 GB) copied, 30.7714 s, 217 MB/s
real    0m30.772s

These numbers are definitely smaller than whatever I remember from zfs-
fuse but I did not record them.

IV. Notes for people planning to migrate:

1. Make sure to create a 30GB FS BEFORE migration just for comparison
purposes, and make sure compression turned off on this FS.
2. Run bonnie++ with storage size twice the RAM size.
3. PREEMPT_NONE may kill your desktop responsiveness during heavy FS
stress. That's a loss if your ZFS server system is your desktop as
well. Be aware of this.
4. Run more exhaustive set of tests. I was in so much hurry to see how
native ZFS performs with my RAIDZ2/RAIDZ1 pools that I did not record
exhaustive tests before migration.

Overall, I am sort of disappointed because I expected (unreasonably)
much better performance than zfs-fuse. This experience created a new
sense of respect for the zfs-fuse project...:-) But I am hopeful that
the performance will come with time and maturity of the project. This
is a great start as is!

-devsk

PS: Each drive in the RAIDZ is capable of doing 120+MB/s sequential
reads and writes. These speeds are never hit by native ZFS but zfs-
fuse hits them occasionally during sequential reads.



More information about the zfs-discuss mailing list