[zfs-discuss] Clarifying questions

Lists lists at benjamindsmith.com
Tue Jan 7 14:07:35 EST 2014


Gordon,

On 01/07/2014 02:02 AM, Gordan Bobic wrote:
>
>         1a) If (1) is correct, we could in theory have a perpetual
>     loop on the master creating snapshots, sending them, then
>     destroying the snapshots before repeating, creating a
>     near-real-time off-site copy? We're thinking disaster recovery
>     ability...
>
>
> You would be better of with lsyncd for continuous asynchronous 
> replication.

My concerns with lsyncd are:
     A) The long duration of the initial rsync before the "Near Real 
time" thing kicks in. In the best of situations, this already takes over 
a day, *just to crawl through all the files*. So to apply security 
updates and reboot means that async replication is effectively down for 
at least 24 hours.
     B) Simple resource exhaustion. Our first test for lsyncd in a very 
limited size scenario required heavy increases in kernel parameters. (I 
forget which at the moment)

I'm hoping ZFS + snapshots + send/receive would likely perform 
significantly better since my understanding is that ZFS would "already 
know" about the delta between a snapshot point and a previous snapshot 
point.

>
>         1b) For the case where we are having to fail to the off site
>     disaster recovery data set: although it's not openly said as such,
>     it would seem that we would want to zfs rollback to the most
>     recent snapshot sent to the DR ZFS server.
>
>
> You could roll back to any snapshot you sent to it. Or you could fail 
> over to it and use the data set that is as current as was plausibly 
> possible if you are using lsyncd.

Although you confirm that I could expect something like rsync "send over 
the changes" with the "-i" option with ZFS send/receive, you then 
immediately suggest an rsync-based tool to be used instead. Rsync is an 
extremely valuable tool, but has caveats mentioned above (see lsync 
issues point #1) for routine offsite backups and/or near realtime 
replication.

Why do you prefer using rsync?

>     2) Is there any problem sending a filesystem from a larger pool to
>     a smaller pool if the receiving pool has plenty of disk space in
>     its pool?
>
>
> Not a problem, as long as what you are sending will fit.

Thanks

>     3) Scrubbing weekly is often enough? How do I get notified if it
>     fails? I'm thinking of mdadm monitor for a RAID 1/5 array, where
>     it sends an overly polite email message that the array is degraded...
>
>
> You write a monitoring script to do whatever you want based on zpool 
> status output.

Can do, was hoping there was something "built in".

>
>     5) Since fragmentation is an issue with COW filesystems, am I
>     right in guessing that SSDs suffer much less in comparison to
>     spinning rust in performance degradation as fragmentation
>     increases? I don't see any way to "defrag" a ZFS pool / filesystem.
>
>
>
> Yes. For example, dedupe feature (which massively increases 
> fragmentation) is only really usable on SSDs if you expect more than 
> about 10MB/s/vdev (more realistically 2.5-3MB/s)

Dedupe seems like a massive pain in the neck. It would offer virtually 
no benefit for our use case anyway, even if it worked perfectly.

>     6) After purchasing hardware, I've read that it's best to get a
>     "power of two" + 1 for RAIDZ1 performance. In other words, 3, or 5
>     drives. (I don't have 9 bays) I chose 4 because we have 8 bays in
>     the 2U server purchased, and I want to be able to add capacity
>     without downtime. In this case, add 4x $BIGGER TB drives, then
>     replace the current 4 TB drives. Is this "sweet spot" for
>     performance at multiples of 2 enough for me to change my plans?
>
>
> Depends on what you are storing. The smaller the transaction commit 
> size (i.e. the block size that will be committed, between ashift and 
> 128KB) the more space overheads and performance degradation you are 
> likely to experience. With spinning rust, you probably need all the 
> help you can get.
>
> With 4TB disks, 3+1 is, IMO, a woefully inadequate level of 
> redundancy, to the point where statistically you are likely to lose 
> some data during resilvering every time you have a disk failure. 2+2 
> RAIDZ2 is a much saner option.

Our file stores are redundant at the application level. The ZFS server 
in question would be implemented as a 3rd store, augmenting two file 
store servers currently running with EXT4 on RAID, so if the ZFS 
partition actually did fail during a resilver, the effect would be minimal.

I'm more interested in maintaining write performance today without 
sabotaging future expandability. 2x 4 drives is easy, 2x 5 drives is 
only doable with eSata or something.

Following your lead, I could do RAIDZ2 with 6 drives total instead of 4 
but then to add another vdev of similar proportions I have to with an 
eSata solution or something. Clumsy...

Looks like it's time to do some performance testing...

Thanks

Ben

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20140107/b2b05e35/attachment.html>


More information about the zfs-discuss mailing list