[zfs-discuss] Improving read IOPS on RAIDZ

Phil Harman phil.harman at gmail.com
Fri Apr 29 09:20:54 EDT 2016


> On 29 Apr 2016, at 12:23, Hans Henrik Happe via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> 
> I don't know why this has turned into "ZFS is so good, that there is no room for improvements" thread.

It’s not so much "ZFS is so good, that there is no room for improvements” as “ZFS had certain design parameters, and running without checksums was not one of them, and we’d rather continue to sleep well at night confident that ZFS does a great job protecting us from bit rot and other forms of silent data corruption, without trying to please all of the people all of the time, because we’d never do that kinda stuff anyway”.

> Anyway, I think the issues is that arc tracks ZFS blocks (stripes), so making a special read path for checksum=off is not just about how reads are done, but also how they cached. That would be far from a trivial change.

To step outside the current elegant variable length record-based design would, indeed, be a “far from a trivial change”. Why break something beautiful for a less than beautiful use case? From 20+ years at Sun, I know how that conversation goes: “oh, we thought you were serious about protecting your data… perhaps you could try FAT32?”

> It would be nice to have cheap storage with no "write hole" issue, checksumming at scrub time + self healing and read IOPS. This could be had with read checksumming off. With alternate checksumming scheme I think we could have it all.

So, you’d like self-healing some of the time, but not all of the time (e.g. I presume it’s ok when reading a whole record)? What would your partial record ARC look like? How would you manage holes in cached blocks within a record (e.g. would waste RAM by allocating space for the whole record, or use some form of costly scatter-gather algorithm)? How would you manage atomic updates of records and avoid write holes? And if an app should happen to do a read-modify-write, you’re happy to do a partial read, then a full read and then a full write?

No, life is generally “A, B or C - pick any two” to some extent or another.

Phil

> Cheers,
> Hans Henrik
> 
> On 29-04-2016 11:55, Phil Harman wrote:
>> ZFS came from a culture at Sun that stated: "correctness is a constraint; performance is a goal".
>> 
>> ZFS' number one feature is that it has checksums. The second is that ZFS is self healing (if you allow it to be). Yes, there's a lot more (snapshots, clones, compression, hybrid pools, send/receive etc), but these were the primary design goals.
>> 
>> When it comes to RAIDZ, ZFS's combined volume manager / filesystem and copy-on-write architecture also neatly sidesteps the RAID5 "write hole" issue.
>> 
>> But here's a "fact": quick, safe, cheap - pick any two.
>> 
>> If you want lots of quick safe 4K random IO, you should probably choose 3-way mirrors, though multiple small RAIDZn vdevs and/or L2ARC and/or ZIL and/or an all SSD pools might be suitable alternatives.
>> 
>> As cost is obviously more important to you than correctness, perhaps you should just layer ZFS over a RAID controller or metadisk driver? At least then ZFS can still tell you when your data has been corrupted (even if you don't want it to fix it for you).
>> 
>> 
>>> On 29 Apr 2016, at 10:10, Hans Henrik Happe via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
>>> 
>>> On 29-04-2016 04:52, Edward Ned Harvey (zfsonlinux) wrote:
>>>>> From: zfs-discuss [mailto:zfs-discuss-bounces at list.zfsonlinux.org] On Behalf
>>>>> Of Hans Henrik Happe via zfs-discuss
>>>>> 
>>>>> The intention of my question was to find out if ZFS could be changed to
>>>>> only access one disk when doing a small reads on RAIDZ.
>>>> 
>>>> If my understanding is correct, that raidz distributes data more similar to raid-1e, then you don't have to change anything to get the desired behavior. It is already that way by default.
>>> 
>>> It is a bit like RAID1E, but even more complicated. Read [1]. The layout is not the problem. It's that the checksum (not parity) is per block (stripe). Reading data that is on one disk only, results in reading from all data disks for that block, to calculate the checksum.
>>> 
>>>> Do you have any reason to believe it's not already that way?
>>> 
>>> Again, read [1] and [2].
>>> 
>>>>> It's a fact, that due to the RAIDZ design, ZFS cannot perform as well as
>>>>> regular RAID5/6, where reads basically work like RAID0. This can be a
>>>> 
>>>> I mean... You say "it's a fact," but the thing you're asserting to be fact, completely disagrees with everything I know and believe. I'm still not seeing any reason to believe your presumption is correct.
>>> 
>>> I'm sorry that my facts, which are not presumptions, do not match your beliefs. Where is your references?
>>> 
>>> Cheers,
>>> Hans Henrik
>>> 
>>> [1] http://blog.delphix.com/matt/2014/06/06/zfs-raidz-stripe-width/
>>> [2] https://blogs.oracle.com/roch/entry/when_to_and_not_to
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at list.zfsonlinux.org
>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss



More information about the zfs-discuss mailing list