[zfs-discuss] very poor io performance

Uncle Stoat stoatwblr at gmail.com
Tue Sep 24 13:07:45 EDT 2013


On 24/09/13 10:54, Zbigniew 'zibi' Jarosik wrote:
> > ZIL is only going to be used for synchronous operations. It will not be
>> touched for anything else. For the sake of testing you can try setting
>> sync=disabled on the FS and see if that helps. If it doesn't, ZIL won't
>> help either.
>>
> There is no way to force ZFS to use SSD as a cache for all writes?

Firstly, you need to understand this statement:

"ZIL is not a write cache. It is a write-intent store."

In other words: ZIL is where pending writes are stored in case of disaster.

Under normal circumstances writes to the main disks come out of RAM.  
ZIL is only read in the event of recovering from a power failure or 
system crash, to flush out semi-complete transactions and give data 
integrity.

This is what gives ZFS its completely atomic write structure. Writes are 
either completed, or not completed - and can be rolled back during 
recovery phase to find the last-known good one in the event of ZIL 
corruption. In the absolute worst case a few writes are lost but the 
filesystem integrity is maintained.

ZIL is conceptually similar to (but goes well beyond) full data 
journalling (An external journal device and a ZIL device have broadly 
similar functions. If a ZIL device is not present then ZIL is striped 
across the vdev devices)


Having said that:

Yes you can force all writes to be "synchronous" - see the ZFS man page 
- "zfs set sync=[standard|always|disabled] [FILESYSTEM]"

In this context, all that it means is that pending async writes are 
written to ZIL along with sync ones.

This gives you greater protection for async writes, but it won't mean 
that writing to disk is any faster and it will result in decreased 
lifespan for the ZIL device - most ZIL devices have more than enough 
endurance to handle the additional load.

In the case of an async-heavy, heavy write load then setting sync=always 
will slow things down slightly, but experiments I've run over the last 
few days whilst rsyncing between ZFS filesystems show that it makes 
almost zero difference.

Whatever makes rsync perform badly on ZFS filesystem is something to do 
with the way rsync interacts with the disk and makes it a poor choice to 
populate an empty directory tree.

If you want to speed things up when first populating a filesystem, use 
"cp -Rn /path/to/source/. /path/to/target/.", followed by "rsync -a 
--size-only --delete-during /path/to/source/. /path/to/target/." to sync 
up the inode times.

rsync -T /run/shm (or /dev/shm) will speed up transfers a lot over 
normal rsync, but it's still nowhere near as fast as the method I've 
just described and it relies on /dev/shm being larger than any of the 
files being synchronised. I don't think using the --delay-updates flag 
will result in any appreciable rsync speedup and I didn't try it (past 
experience shows that it's good for putting everything in place but not 
for the initial transfer)


On my system (19 drive raidz3 using a mix of 7200 and 5400rpom drives, 
plus 256Gb cache and 512Mb ZIL), averaged across about 1Tb of data at 
about 500Mb per file.

rsync from one ZFS filesystem to another inside the same rpool = 30Mb/s
using -T /run/shm = 60Mb/s
Using cp -Rn  = 140-200Mb/s - which is close to the theoretical maximum 
for a single vdev using this kind of drive.


Using "rsync -v" and observing output, plus filesystem/disk activity 
showed some odd behaviour: Opening each file resulted in a 5-10 second 
pause before transfer started. The first couple of Mb would go by slowly 
and then the next 300-400 would flash past in 2-3 seconds (writing into 
/run/shm) and then the temporary 500Mb file in /run/shm  would flush to 
disk in 2-3 seconds. Rinse and repeat.

Based on that behaviour, I suspect that rsync is trying to walk the 
entire directory tree on both sides to check for changes each time it 
opens a new file, which would explain the looong pause when opening each 
file (presumably ZFS doesn't like tree walks?)

There was no measureable difference between performance with 
secondarycache set to "all" or "metadata" and toggling 
zfs_l2arc_nopretch made no difference either.


Notes:

If deduplication is switched on, write performance will start out "ok" 
and rapidly degrade, because deduplication _will_ result in vdev writes 
containing random io, instead of being a lot of sequential operations.

Memory requirements can grow exponentially with deduplicated filesystems 
and if they get larger than the designated ARC size, performance will be 
appalling. If you have enough ram then increase the ARC. If not, add 
more ram and then increase the ARC.

If you're using a dedupe ZFS partition for something like a BackupPC 
repository - which is full of hardlinks _and_ duplicate blocks - you are 
going to make ZFS extremely memory hungry.

Regards
Alan


> If you are seeing better performance out of XFS you are either falling foul
>> of caching issues or the CoW nature of ZFS. Given that your use-case is
>> rsync, the bottleneck is likely related to metadata operations rather than
>> data operations. Try setting the following options for the zfs kernel
>> module:
>>
>> zfs_arc_max <= 1/3 of your RAM
>> zfs_arc_meta_limit = approx. 75-90% of your zfs_arc_max
>> l2arc_noprefetch=0
>>
> OK, i'll try it today evening.
>
>
>> How are you checking what your L2ARC usage is to be able to make the claim
>> that 40GB of it is used?
>>
> from zpool iostat -v 1:
>
>
>                                                          capacity
> operations    bandwidth
> pool                                                 alloc   free   read
>   write   read  write
> ---------------------------------------------------  -----  -----  -----
>   -----  -----  -----
> tank                                                 7.94T   185G    947
>     0   697K      0
>    raidz1                                             7.94T   185G    947
>     0   697K      0
>      scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T0868519            -      -    250
>     0  1.71M      0
>      scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T2316319            -      -    260
>     0  1.82M      0
>      scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T2335838            -      -    306
>     0  1.90M      0
> logs                                                     -      -      -
>     -      -      -
>    mirror                                                 0  7.94G      0
>     0      0      0
>      scsi-SATA_PLEXTOR_PX-128MP02326108896-part1          -      -      0
>     0      0      0
>      scsi-SATA_KINGSTON_SH103S50026B723700A029-part1      -      -      0
>     0      0      0
> cache                                                    -      -      -
>     -      -      -
>    scsi-SATA_PLEXTOR_PX-128MP02326108896-part2        40.0G  7.98M      8
>     0  57.9K  16.0K
>
>
>
>
>> You may also find that you have better results from your L2ARC on a
>> metadata-heavy load like rsync if you set your secondarycache=metadata,
>> especially if you are running L2ARC on zram.
>>
> L2ARC on SSD.
>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.



More information about the zfs-discuss mailing list