[zfs-discuss] maximum number of vdevs

Steve Costaras stevecs at chaven.com
Fri Apr 15 04:19:28 EDT 2016


Just to comment, that large ZFS deployments are NOT bad.   It comes down
to your usage requirements.   For example I have many 400TB+ ZFS pools
on single servers.  This is usually done either with 16 or 24 vdevs per
system (depending on number of bays in each disk tray, vdevs would be
one drive per physical tray and multiple trays) which increases local
performance and some protection for physical hardware failures.   Data
is relatively static; RPO is 24 hours; RTO is 2 weeks for a single box
which is handled by two LTO6 drives per system and bareos for a backup
system.

These systems do have 10Gbit network connections and as pointed out
40Gbit ethernet is cheap on the server side (switch gear is more
expensive but available on say nexus 5K/7K's) if you need to do network
replication in addition to the local tape.   Since data is relatively
static it's not much of a concern.

So just to chime in that the OP needs to really understand his usage
case; there is nothing inherently wrong with a solution like that if it
fits that deployment's usage model.

Steve



On 4/15/2016 1:38 AM, Gordan Bobic via zfs-discuss wrote:
> 
> 
> On Fri, Apr 15, 2016 at 5:45 AM, Richard Yao <ryao at gentoo.org
> <mailto:ryao at gentoo.org>> wrote:
> 
> 
>     On Apr 12, 2016, at 10:08 AM, Gordan Bobic via zfs-discuss
>     <zfs-discuss at list.zfsonlinux.org
>     <mailto:zfs-discuss at list.zfsonlinux.org>> wrote:
> 
>>     I never really understood the fascination with unreasonably huge
>>     pools.
> 
>     A naive calculation suggests that you would need a 100Gbps network
>     interface for zfs send of a full pool to have any hope of finishing
>     within 48 hours when you are using 10-disk raidz2 vdevs of 8TB
>     disks. It would be potentially longer if there is compression
>     (although this will change in the future when data is sent
>     compressed as it is on disk). However if incremental replication is
>     used, this becomes much less of a problem, especially if we are
>     discussing applications such as data archival.
> 
> 
> Re: compression, there is no reason why zfs send stream cannot be piped
> through lz4, gzip/pigz, bzip2/pbzip2, xz or another similar compressor
> (depending on the abundance of CPU)
>  
> 
> 
>     Getting zfs replication to operate at 100Gbps would be a pain. There
>     is no way that piping data into ssh and then over the network as is
>     usually done as part of replication would operate at 100Gbps.
> 
> 
> You don't have to do it over ssh (even with hardware acceleration, that
> kind of throughput using AES128 will be unlikely), you can do it over
> netcat.
>  
> 
>     The single threaded copy into the pipe buffer, then into userspace
>     and then to network buffers alone would require a memcpy that is
>     able to run at 300Gbps, which is faster than any single CPU core can
>     do memcpy to system memory. If we were assume a secure point to
>     point link between systems, then you could write your own program
>     against libzfs_core to make the ZFS driver write directly to the
>     network socket to avoid the double copy, but I suspect that you
>     would need driver (and maybe also kernel) changes to get it to be
>     able saturate 100Gbps.
> 
>     A point to point link would have low latencies, so at least the TCP
>     receive window would not be a problem. A 5 microsecond network
>     latency would be enough for a 64KB receive window to keep a 100Gbps
>     connection saturated. With normal network topologies, things are
>     quite not so nice when you look at the TCP receive window sizes. You
>     would want a TCP receive window size of ~128MB or more for a 10ms
>     network latency. Possibly more if latencies vary like on normal
>     networks. Other things like packet loss would trigger painful
>     effects of TCP congestion control so the congestion control
>     algorithm would likely need adjustment too.
> 
> 
> Those kinds of network bandwidth is hard to achieve even with
> infiniband. Now, granted, 1st gen infiniband kit is dirt cheap
> (40Gbit/s) if you can live with the 15m distance limit, but even that is
> a far cry from 300Gbit/s
>  
> 
> 
>>     It strikes me that any such system is going to end in tears
>>     eventually, either when hardware needs to be refreshed, or when
>>     there is a catastrophic failure - which on a long enough timeline
>>     is a statistical certainty. Just consider the restore from backup
>>     time if you lose the whole pool for any reason.
> 
> 
> It's sounds like it'd pretty much _begin_ in tears, let alone end in
> them, hence my original remark about it being a terrible idea.
> 
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160415/65e5da9a/attachment.sig>


More information about the zfs-discuss mailing list