[zfs-discuss] Volume takes 177% of requested size

Uwe Sauter uwe.sauter.de at gmail.com
Wed Dec 31 05:17:55 EST 2014

Hi Richard,

On Wednesday, December 31, 2014 1:10:21 AM UTC+1, Richard wrote:
> On Sat, 2014-12-27 at 09:06 +0100, Bernd P. Ziller wrote: 
> > It's the zvols blocksize of 512 bytes and the resulting metadata 
> overhead. (Plus the write overhead for raidz?) 
> > I guess someone with deeper insight can explain this further. :) 
> There are two issues in play in the original question: 
> 1) By default, ZFS volumes have a refreservation so you "can't" run out 
> of disk space. (Quotes because this functionality is actually broken by 
> the #2 below.) When you take a snapshot of a volume with a 
> refreservation, ZFS has to reserve enough space that you can overwrite 
> the entire volume and still keep the old data in the snapshot. 
> Creating the volumes with the -s (sparse) flag skips creating the 
> refreservation, so no space is reserved anywhere. 
Yes, that worked. Setting refreservation=none on a non-sparse volume works, 
too, like Fajar suggested.

> 2) When ZFS dataset blocks hit the raidz layer, they get split up into 
> blocks that are sized based on the disk block size (ashift) and then 
> have parity blocks added. So a 128k dataset block on a 3-disk raidz1 
> gets split into 64k of data per disk, plus 64k of overhead, for a total 
> of 192k. 
This is normal behavior for any RAID-like implementation. But the overhead 
for any kind of redundancy (besides copies > 1) should not be visible on 
file system level but only on pool level. And this overhead is a constant 
ratio, only depending on the number of disks and the RAID / redundancy 

> But if you have non-optimal topologies and/or ashift=12 (4k disks) 
> and/or 512b zvol blocks, this can get absurdly wasteful. For example, a 
> 512b dataset block turns into one 4k block of data on one disk, plus 4k 
> of parity. Thus, 512b of zvol data turned into 8k of raw disk usage, a 
> 16x increase. On a 3-disk raidz1, one expects a 1.5x increase, as in my 
> first example. 
True, but should not matter on file system level. On that level I should 
only see the penalty of saving only 512b in a 4k block, a 8x increase. As 
stated above, the overhead  for RAID should not matter on file system level.
Even so, that leaves me wondering why there is the difference between the 
77% overhead that I see and the 700% wasted by a 512b on 4k setup.


> Using larger zvol block sizes clearly minimizes the overhead. It also 
> increases the opportunity for compression. At 512b block sizes, there's 
> never any opportunity for compression, except for all-zero blocks. 
> That's because no matter how much you reduce the logical block (unless 
> you can get all the way to 0 bytes), you still have to write something 
> to disk, and on disk, it will be 512b anyway, so it makes no sense to 
> bother writing compressed data that just has to be uncompressed on 
> reads. 

Can't argue against that.

> Thus, the solution for the original use case (where performance is 
> non-critical and it's read-mostly) is: 
> zfs create storage/Backup/disk -s -b 128k -V 1953514584k 

> For non-archive applications where performance is critical, the typical 
> recommendation is that one should match the zvol block size to the 
> filesystem block size (e.g. 4k for NTFS, unless you change it from the 
> default, which can't be done on a C: drive). 

Which just are recommendations for performance but doesn't say anything for 
the use case where you want to backup a complete disk into a ZVOL. And I 
think by now that it is a bad idea to do so as long as ZFS lacks the 
ability to control the inner sector size of a ZVOL. Just think what might 
happen if you back up a 4k disk into a 512b ZVOL... all partitions within 
would shrink by the factor 8, most of the data would not be addressable any 
moreā€¦ that's horror for someone who wants to recover data.
And if you were to just save every partition in its own ZVOL to problem 
would persist. Today, file systems check the size of the underlying block 
device while creation but if you change that afterwards.... ouch.

Any thoughts on that "disk recovery" use case?

Happy new year,



> An alternative school of 
> thought--as given my Nexenta reseller--is that keeping metadata in RAM 
> outweights some performance impact from read-modify-write; they 
> recommend 64k blocksize/recordsize as a default for everything. 
> -- 
> Richard 

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20141231/a1dc7d2e/attachment.html>

More information about the zfs-discuss mailing list