[zfs-discuss] raidz1 or raidz2

Niels de Carpentier zfs at decarpentier.com
Sun Nov 4 17:22:08 EST 2012


Steve Costaras wrote:
>
> As for space 'wastage' there is some but it does not appear as bad as you are making it.  (check with zdb -ddddd <dataset> ).    Your model only applies if you have variable block sizes
> SMALLER than your set record size.   i.e. using say recordsize of 8KiB as a default.   And storing files <=128KiB across your vdev.    This 'partiality' is always there in different forms.
>  If you set your recorsize to 128KiB (max for zfs currently) then you will just leave blocks empty on the drives that are more than what you need to store the file.   (you will waste the
> last remainder 128KiB - (Filesize % Recordize).     Whether this is bad or not (in any case above) is dependent on the average stored file size.   ZFS will try (if you let it by setting a
> small record size) to use all member disks spread out I/O load.   This is not ideal in all cases and that's when you should set your array (vdev) and record size up appropriately.

That's the filesystem overhead which is unrelated to the number of disks in a raidz. But when the system needs to write a 128KiB block to disk, it will spread it over the disks in the vdev.
So if you have 4 data disks, it will write 32KiB (8 sectors) to each one. But if you have 5 data disks, you cannot split it up that nicely anymore. You have to write the same number of
sectors to each disk, and so you have to write 28KiB (7 sectors) to each disk. So now you have written 140KiB to store a 128KiB block. So you waste space and lose performance.
The is why it is recommended to have 2,4 or 8 data disks.
>
> Not really, as in this case the issues were due to the drive's firmware which we (general end users) don't have any control over.    How can you reliability code something in software if
> the drive may or may not respond (normal case in ms, worst case 10 minutes)?    This is where time limited recovery comes in.   Unfortunately not all desktop/commercial drives support this
> as a user settable option like the enterprise class drives.   All we can do is hang the bus waiting and then try and time out the command (scsi device layer).   Though since the drive
> itself is at fault here and in large arrays behind other abstraction units (expanders, hba's, etc) it can continue to hang that bus (not respond to software's attempt to reset the
> hba/bus).    The problem really is not the software but the crappy drives themselves.

Yes, the problem is well known. But getting a scsi timeout is not the only option, you can also just wait until the drive finally responds (It will eventually).
If you do this the system will see the failed read, and will reconstruct the sector and write it again. In this case the drive will remap the sector and the problem will be solved.
(until another sector goes bad of course). So increasing the scsi timeouts so the system doesn't get a scsi timeout can solve the issue.
The disadvantage is that the system can hang for quite a while, but only once. You also don't get the bus reset that can cause problems.
On the other hand if you get a scsi timeout and try to read the sector again the same thing starts again and the sector never gets rewritten.

Niels




More information about the zfs-discuss mailing list