[zfs-discuss] maximum number of vdevs

Richard Yao ryao at gentoo.org
Fri Apr 15 15:34:29 EDT 2016

On 04/15/2016 02:38 AM, Gordan Bobic via zfs-discuss wrote:
> On Fri, Apr 15, 2016 at 5:45 AM, Richard Yao <ryao at gentoo.org> wrote:
>> On Apr 12, 2016, at 10:08 AM, Gordan Bobic via zfs-discuss <
>> zfs-discuss at list.zfsonlinux.org> wrote:
>> I never really understood the fascination with unreasonably huge pools.
>> A naive calculation suggests that you would need a 100Gbps network
>> interface for zfs send of a full pool to have any hope of finishing within
>> 48 hours when you are using 10-disk raidz2 vdevs of 8TB disks. It would be
>> potentially longer if there is compression (although this will change in
>> the future when data is sent compressed as it is on disk). However if
>> incremental replication is used, this becomes much less of a problem,
>> especially if we are discussing applications such as data archival.
> Re: compression, there is no reason why zfs send stream cannot be piped
> through lz4, gzip/pigz, bzip2/pbzip2, xz or another similar compressor
> (depending on the abundance of CPU)
>> Getting zfs replication to operate at 100Gbps would be a pain. There is no
>> way that piping data into ssh and then over the network as is usually done
>> as part of replication would operate at 100Gbps.
> You don't have to do it over ssh (even with hardware acceleration, that
> kind of throughput using AES128 will be unlikely), you can do it over
> netcat.

The rate of a composition of two functions is roughly (A * B) / (A + B)
where A is the rate of one and B is the rate of the other. Compositing
the same function whose rate is R with itself N times is R / N. This
operation is both commutative and associative. Therefore, reordering
things makes no difference on the final result.

From the numbers from Yann Coliet's github readme, we have that memcpy
runs at 4200MB/sec and LZ4 runs at 1850MB/sec on a Core i5-4300U
@1.9GHz. Also, LZ4 compression will average a factor of 2.101:


Using the equation(s) for calculating the performance of composite
functions with those numbers, what you suggested requires that we must
iterate 4 times for memcpy at 4200MB/sec, once for LZ4 compression at
1850MB/sec, once for encryption at 2000MB/sec, and once for that last
network copy at 2.101 * 4200MB/sec.

4200MB/sec / 3 = 1400/sec
(1400MB/sec * 1850MB/sec) / (1400MB/sec + 1850MB/sec) = 800MB/sec
(800MB/sec * 2.101 * 4200MB/sec) / (800MB/sec + 2.101 * 4200 MB/sec) =
730 MB/sec

That would effectively transfer 1500MB/sec, which is greater than to
10Gb/sec, but the network link itself would be at 730MB/sec, which is
not as fast as things could go on 10Gbps (~2600MB/sec after LZ4
compression). Neither is anywhere near 100Gbps (~12000Mbps before
compression, ~25000 MB/sec after compression).

If we assume the on-disk data is LZ4 compressed as well as improvements
such as compressed ARC and compressed send/receive are available, we can
not only ignore in-kernel decompression, but we would have compression
for free. That gives us the overhead of a copy into userspace and a copy
into the kernel network stack:

4200MB/sec / 2 = 2100MB/sec

There is also the overhead of checksum verification, but given that is
process in parallel (and I lack numbers for it), I am ignoring it as a
simplification (e.g. assuming checksum=off). That is not strictly right,
but I do not expect parallel checksum verification to be bottlenecking
us yet.

This definitely can do 10Gbps on modern hardware, assuming that the disk
IO can keep up, the driver does a good job of extracting it and the
network IO does not slow us down. However, we can push it a little
further by writing a program that passes the fd for the network socket
directly to the kernel so that we only do 1 copy. That would make
throughput would be 4200MB/sec (8800MB/sec effective). Everything thus
far assumed that the network code does not add any copies beyond the
copy into the file descriptor.

This is definitely not enough for 100Gbps, but it is getting there. If
we get less modest hardware than Yan Coliet's Core i5 and ensure it has
sufficient disk IO, we would have a chance of attaining 100Gbps with
further improvements.

This of course assumes no bottlenecks are present in the ZFS driver's
ability to marshall buffers from the disks into the file descriptor.
That is not a terrible assumption given that ZFS can take advantage of
multicore and multiple in-flight IOs, although a module parameter or two
might need to be changed the traversal to keep the ZIO pipeline busy enough.

As for the network IO, I am going to assume that TCP's receive window
and congestion control is properly tuned. Then I can abuse some nubmers
from cloudflare to make an educated guess on the capability of network
hardware (excluding concerns about TCP receive windows, latencies and
overzealous TCP congestion control):


A userland process sending packets bottlenecks at about 370,000 32-byte
packets per second on their 2GHz Xeon, which when only 1 core is in use,
is roughly equivalent to Yan Coliet's hardware. There is also plenty of
syscall overhead and other factors, but lets use that number and assume
we can send 9000 byte packets at that rate in the hope that things
balance out. That gives us 9000 bytes / packet * 370,000 packets per
second = ~3,300 MB/sec or 6,900MB/sec once compression is considered.

The best Intel cores would probably only yield something like a 50%
improvement, which is far from enough for 100Gbps, but might reach
40Gbps. Getting to 100Gbps would probably need us to make changes to how
the data is shuffled between ZFS and the network stack to do zero copy
and might require multiqueue too. At that point, I would expect the
performance of checksum calculations to be a factor.

This also only discusses reads on the send side, where overhead from
raidz parity and/or write amplification from mirroring are non-issues.
Such things would need to be considered on the receive side, where
processing the receive stream at 100Gbps would probably be harder than
sending it was. We would definitely want to accelerate checksum and
parity calculations with vector instructions in that case. I will avoid
getting started on theoretical performance so this email does not become
much longer.

Anyway, this is just me using some benchmark numbers and a couple of
equations to suggest that replication performance can be pushed far
higher than people typically experience. Reality will likely differ
(especially for LZ4 compression of incompressible data, where
performance is amazing), but I think the boundary of what is possible is
higher than you think. :)

>> The single threaded copy into the pipe buffer, then into userspace and
>> then to network buffers alone would require a memcpy that is able to run at
>> 300Gbps, which is faster than any single CPU core can do memcpy to system
>> memory. If we were assume a secure point to point link between systems,
>> then you could write your own program against libzfs_core to make the ZFS
>> driver write directly to the network socket to avoid the double copy, but I
>> suspect that you would need driver (and maybe also kernel) changes to get
>> it to be able saturate 100Gbps.
>> A point to point link would have low latencies, so at least the TCP
>> receive window would not be a problem. A 5 microsecond network latency
>> would be enough for a 64KB receive window to keep a 100Gbps connection
>> saturated. With normal network topologies, things are quite not so nice
>> when you look at the TCP receive window sizes. You would want a TCP receive
>> window size of ~128MB or more for a 10ms network latency. Possibly more if
>> latencies vary like on normal networks. Other things like packet loss would
>> trigger painful effects of TCP congestion control so the congestion control
>> algorithm would likely need adjustment too.
> Those kinds of network bandwidth is hard to achieve even with infiniband.
> Now, granted, 1st gen infiniband kit is dirt cheap (40Gbit/s) if you can
> live with the 15m distance limit, but even that is a far cry from 300Gbit/s

100GbE can do it while infiniband has long supported 80Gbps/112Gbps via
bonding. Getting the network stack and ZFS to do 100Gbps simultaneously
would be a pain. With 9KB jumbo ethernet frames, you would need the
Linux network stack to do ~1.4 million packets per second at the same
time ZFS is processing 100Gbps of data without the combination of the
two requiring more resources than the machine has. I do not think it is
unattainable, but as I said in my previous email, attaining this level
of performance would be a pain.

>> It strikes me that any such system is going to end in tears eventually,
>> either when hardware needs to be refreshed, or when there is a catastrophic
>> failure - which on a long enough timeline is a statistical certainty. Just
>> consider the restore from backup time if you lose the whole pool for any
>> reason.
> It's sounds like it'd pretty much _begin_ in tears, let alone end in them,
> hence my original remark about it being a terrible idea.

I do not think there is anything wrong with putting 300 disks into a ZFS
pool. Multiple people have put 300 disks or more into a system and it
works fine. Whether or not a cluster is better is a discussion of the
point where clustering makes more sense than a single system image machine.

The benefits of a clustered filesystem for large data sets over ZFS are:

- Potential for scaling performance beyond what a single system image
machine can do (HP)
- Potential for transparent failover when a node fails (HA)

Usually when you cluster, you do it for either of those two or for both.
If you need neither, you typically do not cluster.

The benefits of ZFS for large data sets over clusters are:

- Strong data integrity
- Lower costs in equipment, power and rackspace
- Simplified administration

There are applications that gain more from ZFS' benefits than they gain
from a cluster's benefits, even when we are talking about sizes that are
usually the domain of clusters, such as 7PB. Archival is the obvious
example, although there are others where the performance characteristics
are sufficient.

There are also applications that need the benefits of a cluster more
than they need the benefits of ZFS, such as providing a beowulf cluster
with access to unified storage at obscene levels of I/O, like what LLNL
does. In that case, LLNL uses lustre on top of ZFS. That allows lustre
to reuse ZFS' checksums to provide strong data integrity guarantees with
the high performance and high availability of a clustered filesystem.

> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160415/a8813dd3/attachment-0001.sig>

More information about the zfs-discuss mailing list