[zfs-discuss] maximum number of vdevs

Richard Yao ryao at gentoo.org
Fri Apr 15 16:10:29 EDT 2016

On 04/15/2016 03:34 PM, Richard Yao wrote:
> On 04/15/2016 02:38 AM, Gordan Bobic via zfs-discuss wrote:
>> On Fri, Apr 15, 2016 at 5:45 AM, Richard Yao <ryao at gentoo.org> wrote:
>>> On Apr 12, 2016, at 10:08 AM, Gordan Bobic via zfs-discuss <
>>> zfs-discuss at list.zfsonlinux.org> wrote:
>>> I never really understood the fascination with unreasonably huge pools.
>>> A naive calculation suggests that you would need a 100Gbps network
>>> interface for zfs send of a full pool to have any hope of finishing within
>>> 48 hours when you are using 10-disk raidz2 vdevs of 8TB disks. It would be
>>> potentially longer if there is compression (although this will change in
>>> the future when data is sent compressed as it is on disk). However if
>>> incremental replication is used, this becomes much less of a problem,
>>> especially if we are discussing applications such as data archival.
>> Re: compression, there is no reason why zfs send stream cannot be piped
>> through lz4, gzip/pigz, bzip2/pbzip2, xz or another similar compressor
>> (depending on the abundance of CPU)
>>> Getting zfs replication to operate at 100Gbps would be a pain. There is no
>>> way that piping data into ssh and then over the network as is usually done
>>> as part of replication would operate at 100Gbps.
>> You don't have to do it over ssh (even with hardware acceleration, that
>> kind of throughput using AES128 will be unlikely), you can do it over
>> netcat.
> The rate of a composition of two functions is roughly (A * B) / (A + B)
> where A is the rate of one and B is the rate of the other. Compositing
> the same function whose rate is R with itself N times is R / N. This
> operation is both commutative and associative. Therefore, reordering
> things makes no difference on the final result.
> From the numbers from Yann Coliet's github readme, we have that memcpy
> runs at 4200MB/sec and LZ4 runs at 1850MB/sec on a Core i5-4300U
> @1.9GHz. Also, LZ4 compression will average a factor of 2.101:
> https://github.com/Cyan4973/lz4/blob/master/README.md
> Using the equation(s) for calculating the performance of composite
> functions with those numbers, what you suggested requires that we must
> iterate 4 times for memcpy at 4200MB/sec, once for LZ4 compression at
> 1850MB/sec, once for encryption at 2000MB/sec, and once for that last
> network copy at 2.101 * 4200MB/sec.
> 4200MB/sec / 3 = 1400/sec
> (1400MB/sec * 1850MB/sec) / (1400MB/sec + 1850MB/sec) = 800MB/sec
> (800MB/sec * 2.101 * 4200MB/sec) / (800MB/sec + 2.101 * 4200 MB/sec) =
> 730 MB/sec
> That would effectively transfer 1500MB/sec, which is greater than to
> 10Gb/sec, but the network link itself would be at 730MB/sec, which is
> not as fast as things could go on 10Gbps (~2600MB/sec after LZ4
> compression). Neither is anywhere near 100Gbps (~12000Mbps before
> compression, ~25000 MB/sec after compression).

I mistakenly did this calculation for fastcat, which uses the splice
syscall to avoid a copy:


The calculation for netcat would require 1 extra copy and make the
numbers 620MB/sec on the wire and 1300MB/sec compressed.

The write into the file descriptor is 1 copy and each thing through
which you pipe is 2 copies unless one does splice like fastcat does.

> If we assume the on-disk data is LZ4 compressed as well as improvements
> such as compressed ARC and compressed send/receive are available, we can
> not only ignore in-kernel decompression, but we would have compression
> for free. That gives us the overhead of a copy into userspace and a copy
> into the kernel network stack:
> 4200MB/sec / 2 = 2100MB/sec

This number is also for fastcat, not netcat. The number for netcat
should be 1400MB/sec because it would require 3 copies.

> There is also the overhead of checksum verification, but given that is
> process in parallel (and I lack numbers for it), I am ignoring it as a
> simplification (e.g. assuming checksum=off). That is not strictly right,
> but I do not expect parallel checksum verification to be bottlenecking
> us yet.
> This definitely can do 10Gbps on modern hardware, assuming that the disk
> IO can keep up, the driver does a good job of extracting it and the
> network IO does not slow us down. However, we can push it a little
> further by writing a program that passes the fd for the network socket
> directly to the kernel so that we only do 1 copy. That would make
> throughput would be 4200MB/sec (8800MB/sec effective). Everything thus
> far assumed that the network code does not add any copies beyond the
> copy into the file descriptor.

This number is correct. Also, my idea of writing a program to do this is
unnecessary on the send side where you can use bash shell redirection:


I did a quick test of this:

# In shell A:
nc -l -p 21150

# In shell B:
echo Hello Single Copy Network Write > /dev/tcp/

It unfortunately doe snot work for receiving:


If someone finds a sane way to start a process with a network socket as
stdin, please share how. If the method is documented, it would be
possible to make send/recv faster in situations where writing to the
network socket is not a security risk (e.g. point to point links).

> This is definitely not enough for 100Gbps, but it is getting there. If
> we get less modest hardware than Yan Coliet's Core i5 and ensure it has
> sufficient disk IO, we would have a chance of attaining 100Gbps with
> further improvements.
> This of course assumes no bottlenecks are present in the ZFS driver's
> ability to marshall buffers from the disks into the file descriptor.
> That is not a terrible assumption given that ZFS can take advantage of
> multicore and multiple in-flight IOs, although a module parameter or two
> might need to be changed the traversal to keep the ZIO pipeline busy enough.
> As for the network IO, I am going to assume that TCP's receive window
> and congestion control is properly tuned. Then I can abuse some nubmers
> from cloudflare to make an educated guess on the capability of network
> hardware (excluding concerns about TCP receive windows, latencies and
> overzealous TCP congestion control):
> https://blog.cloudflare.com/how-to-receive-a-million-packets/
> A userland process sending packets bottlenecks at about 370,000 32-byte
> packets per second on their 2GHz Xeon, which when only 1 core is in use,
> is roughly equivalent to Yan Coliet's hardware. There is also plenty of
> syscall overhead and other factors, but lets use that number and assume
> we can send 9000 byte packets at that rate in the hope that things
> balance out. That gives us 9000 bytes / packet * 370,000 packets per
> second = ~3,300 MB/sec or 6,900MB/sec once compression is considered.
> The best Intel cores would probably only yield something like a 50%
> improvement, which is far from enough for 100Gbps, but might reach
> 40Gbps. Getting to 100Gbps would probably need us to make changes to how
> the data is shuffled between ZFS and the network stack to do zero copy
> and might require multiqueue too. At that point, I would expect the
> performance of checksum calculations to be a factor.
> This also only discusses reads on the send side, where overhead from
> raidz parity and/or write amplification from mirroring are non-issues.
> Such things would need to be considered on the receive side, where
> processing the receive stream at 100Gbps would probably be harder than
> sending it was. We would definitely want to accelerate checksum and
> parity calculations with vector instructions in that case. I will avoid
> getting started on theoretical performance so this email does not become
> much longer.
> Anyway, this is just me using some benchmark numbers and a couple of
> equations to suggest that replication performance can be pushed far
> higher than people typically experience. Reality will likely differ
> (especially for LZ4 compression of incompressible data, where
> performance is amazing), but I think the boundary of what is possible is
> higher than you think. :)
>>> The single threaded copy into the pipe buffer, then into userspace and
>>> then to network buffers alone would require a memcpy that is able to run at
>>> 300Gbps, which is faster than any single CPU core can do memcpy to system
>>> memory. If we were assume a secure point to point link between systems,
>>> then you could write your own program against libzfs_core to make the ZFS
>>> driver write directly to the network socket to avoid the double copy, but I
>>> suspect that you would need driver (and maybe also kernel) changes to get
>>> it to be able saturate 100Gbps.
>>> A point to point link would have low latencies, so at least the TCP
>>> receive window would not be a problem. A 5 microsecond network latency
>>> would be enough for a 64KB receive window to keep a 100Gbps connection
>>> saturated. With normal network topologies, things are quite not so nice
>>> when you look at the TCP receive window sizes. You would want a TCP receive
>>> window size of ~128MB or more for a 10ms network latency. Possibly more if
>>> latencies vary like on normal networks. Other things like packet loss would
>>> trigger painful effects of TCP congestion control so the congestion control
>>> algorithm would likely need adjustment too.
>> Those kinds of network bandwidth is hard to achieve even with infiniband.
>> Now, granted, 1st gen infiniband kit is dirt cheap (40Gbit/s) if you can
>> live with the 15m distance limit, but even that is a far cry from 300Gbit/s
> 100GbE can do it while infiniband has long supported 80Gbps/112Gbps via
> bonding. Getting the network stack and ZFS to do 100Gbps simultaneously
> would be a pain. With 9KB jumbo ethernet frames, you would need the
> Linux network stack to do ~1.4 million packets per second at the same
> time ZFS is processing 100Gbps of data without the combination of the
> two requiring more resources than the machine has. I do not think it is
> unattainable, but as I said in my previous email, attaining this level
> of performance would be a pain.
>>> It strikes me that any such system is going to end in tears eventually,
>>> either when hardware needs to be refreshed, or when there is a catastrophic
>>> failure - which on a long enough timeline is a statistical certainty. Just
>>> consider the restore from backup time if you lose the whole pool for any
>>> reason.
>> It's sounds like it'd pretty much _begin_ in tears, let alone end in them,
>> hence my original remark about it being a terrible idea.
> I do not think there is anything wrong with putting 300 disks into a ZFS
> pool. Multiple people have put 300 disks or more into a system and it
> works fine. Whether or not a cluster is better is a discussion of the
> point where clustering makes more sense than a single system image machine.
> The benefits of a clustered filesystem for large data sets over ZFS are:
> - Potential for scaling performance beyond what a single system image
> machine can do (HP)
> - Potential for transparent failover when a node fails (HA)
> Usually when you cluster, you do it for either of those two or for both.
> If you need neither, you typically do not cluster.
> The benefits of ZFS for large data sets over clusters are:
> - Strong data integrity
> - Lower costs in equipment, power and rackspace
> - Simplified administration
> There are applications that gain more from ZFS' benefits than they gain
> from a cluster's benefits, even when we are talking about sizes that are
> usually the domain of clusters, such as 7PB. Archival is the obvious
> example, although there are others where the performance characteristics
> are sufficient.
> There are also applications that need the benefits of a cluster more
> than they need the benefits of ZFS, such as providing a beowulf cluster
> with access to unified storage at obscene levels of I/O, like what LLNL
> does. In that case, LLNL uses lustre on top of ZFS. That allows lustre
> to reuse ZFS' checksums to provide strong data integrity guarantees with
> the high performance and high availability of a clustered filesystem.
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at list.zfsonlinux.org
>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160415/ab80ab6d/attachment.sig>

More information about the zfs-discuss mailing list