[zfs-discuss] very poor io performance
stoatwblr at gmail.com
Tue Sep 24 13:07:45 EDT 2013
On 24/09/13 10:54, Zbigniew 'zibi' Jarosik wrote:
> > ZIL is only going to be used for synchronous operations. It will not be
>> touched for anything else. For the sake of testing you can try setting
>> sync=disabled on the FS and see if that helps. If it doesn't, ZIL won't
>> help either.
> There is no way to force ZFS to use SSD as a cache for all writes?
Firstly, you need to understand this statement:
"ZIL is not a write cache. It is a write-intent store."
In other words: ZIL is where pending writes are stored in case of disaster.
Under normal circumstances writes to the main disks come out of RAM.
ZIL is only read in the event of recovering from a power failure or
system crash, to flush out semi-complete transactions and give data
This is what gives ZFS its completely atomic write structure. Writes are
either completed, or not completed - and can be rolled back during
recovery phase to find the last-known good one in the event of ZIL
corruption. In the absolute worst case a few writes are lost but the
filesystem integrity is maintained.
ZIL is conceptually similar to (but goes well beyond) full data
journalling (An external journal device and a ZIL device have broadly
similar functions. If a ZIL device is not present then ZIL is striped
across the vdev devices)
Having said that:
Yes you can force all writes to be "synchronous" - see the ZFS man page
- "zfs set sync=[standard|always|disabled] [FILESYSTEM]"
In this context, all that it means is that pending async writes are
written to ZIL along with sync ones.
This gives you greater protection for async writes, but it won't mean
that writing to disk is any faster and it will result in decreased
lifespan for the ZIL device - most ZIL devices have more than enough
endurance to handle the additional load.
In the case of an async-heavy, heavy write load then setting sync=always
will slow things down slightly, but experiments I've run over the last
few days whilst rsyncing between ZFS filesystems show that it makes
almost zero difference.
Whatever makes rsync perform badly on ZFS filesystem is something to do
with the way rsync interacts with the disk and makes it a poor choice to
populate an empty directory tree.
If you want to speed things up when first populating a filesystem, use
"cp -Rn /path/to/source/. /path/to/target/.", followed by "rsync -a
--size-only --delete-during /path/to/source/. /path/to/target/." to sync
up the inode times.
rsync -T /run/shm (or /dev/shm) will speed up transfers a lot over
normal rsync, but it's still nowhere near as fast as the method I've
just described and it relies on /dev/shm being larger than any of the
files being synchronised. I don't think using the --delay-updates flag
will result in any appreciable rsync speedup and I didn't try it (past
experience shows that it's good for putting everything in place but not
for the initial transfer)
On my system (19 drive raidz3 using a mix of 7200 and 5400rpom drives,
plus 256Gb cache and 512Mb ZIL), averaged across about 1Tb of data at
about 500Mb per file.
rsync from one ZFS filesystem to another inside the same rpool = 30Mb/s
using -T /run/shm = 60Mb/s
Using cp -Rn = 140-200Mb/s - which is close to the theoretical maximum
for a single vdev using this kind of drive.
Using "rsync -v" and observing output, plus filesystem/disk activity
showed some odd behaviour: Opening each file resulted in a 5-10 second
pause before transfer started. The first couple of Mb would go by slowly
and then the next 300-400 would flash past in 2-3 seconds (writing into
/run/shm) and then the temporary 500Mb file in /run/shm would flush to
disk in 2-3 seconds. Rinse and repeat.
Based on that behaviour, I suspect that rsync is trying to walk the
entire directory tree on both sides to check for changes each time it
opens a new file, which would explain the looong pause when opening each
file (presumably ZFS doesn't like tree walks?)
There was no measureable difference between performance with
secondarycache set to "all" or "metadata" and toggling
zfs_l2arc_nopretch made no difference either.
If deduplication is switched on, write performance will start out "ok"
and rapidly degrade, because deduplication _will_ result in vdev writes
containing random io, instead of being a lot of sequential operations.
Memory requirements can grow exponentially with deduplicated filesystems
and if they get larger than the designated ARC size, performance will be
appalling. If you have enough ram then increase the ARC. If not, add
more ram and then increase the ARC.
If you're using a dedupe ZFS partition for something like a BackupPC
repository - which is full of hardlinks _and_ duplicate blocks - you are
going to make ZFS extremely memory hungry.
> If you are seeing better performance out of XFS you are either falling foul
>> of caching issues or the CoW nature of ZFS. Given that your use-case is
>> rsync, the bottleneck is likely related to metadata operations rather than
>> data operations. Try setting the following options for the zfs kernel
>> zfs_arc_max <= 1/3 of your RAM
>> zfs_arc_meta_limit = approx. 75-90% of your zfs_arc_max
> OK, i'll try it today evening.
>> How are you checking what your L2ARC usage is to be able to make the claim
>> that 40GB of it is used?
> from zpool iostat -v 1:
> operations bandwidth
> pool alloc free read
> write read write
> --------------------------------------------------- ----- ----- -----
> ----- ----- -----
> tank 7.94T 185G 947
> 0 697K 0
> raidz1 7.94T 185G 947
> 0 697K 0
> scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T0868519 - - 250
> 0 1.71M 0
> scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T2316319 - - 260
> 0 1.82M 0
> scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T2335838 - - 306
> 0 1.90M 0
> logs - - -
> - - -
> mirror 0 7.94G 0
> 0 0 0
> scsi-SATA_PLEXTOR_PX-128MP02326108896-part1 - - 0
> 0 0 0
> scsi-SATA_KINGSTON_SH103S50026B723700A029-part1 - - 0
> 0 0 0
> cache - - -
> - - -
> scsi-SATA_PLEXTOR_PX-128MP02326108896-part2 40.0G 7.98M 8
> 0 57.9K 16.0K
>> You may also find that you have better results from your L2ARC on a
>> metadata-heavy load like rsync if you set your secondarycache=metadata,
>> especially if you are running L2ARC on zram.
> L2ARC on SSD.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.
More information about the zfs-discuss