[zfs-discuss] zfs databse optimization

Amir Christopher Najmi amir.c.najmi at gmail.com
Thu Dec 29 12:29:26 EST 2016


Hi Gordon, 

Great to hear from you, if you prepare any notes or slides for the meet-up, I'd be delighted to see them!

If I read you correctly matt's advice is more applicable for raidz vdevs, but for mirrored vdevs, regardless of compression settings, there is still a significant benefit from matching the zfs max record size to the db record size? 



> On Dec 29, 2016, at 11:37 AM, Gordan Bobic via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> 
>> On Thu, Dec 29, 2016 at 4:03 PM, Chris via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
>> Hi,
>> 
>> I might be running a database on ZFS for a project, what sort of performance optimizations should I plan to make or look into?
> 
> Funnily enough, I'm planning a talk on exactly this subject at the next London MySQL meetup. :-)
>  
>> 
>> I've been reading tuning guides and whatever information I can find abut ZFS and databases for planning and preparation but I'm not sure I understand some of the suggestions and some of them don't seem to be consistent with how I understand ZFS behaves. 
>> 
>> Consider a straightforward case, a database server (postgres, mariadb, mysql) is running a database on top of a ZFS dataset on local storage. 
>> 
>> The basic stuff:
>> 0) the pool geometry - multiple mirror vdevs are usually better for performance in random IO.
>> 1) primarycache - set this to "metadata" because presumably the database server is where you want to be doing the caching, read-ahead, etc.
> 
> For MySQL/MariaDB, yes, innodb_buffer_pool will be doing all the caching for you.
> Note that PostgreSQL guys have until _very_ recently been (perhaps erroneously) advising relying on the OS cache rather than the shared_buffers. I believe they revised that recently, so now advice is the same as for MySQL (don't rely on OS doing the DB's caching).
>  
>> 2) compression - lz4, databases tend to be very compressible, on anything more than a dual-core this is probably free IOPS.
>> 3) atime - relatime = on is the default, but I don't care about atime at all on a database so probably set this to atime = off.
>> 4) dedupe - don't touch this
> 
> So far so good.
>  
>> 
>> The confusing one:
>> 5) recordsize - I've read the openzfs tuning guide that suggests use recordsize = 8k for databases, but then I've also read Matt Ahrens' article about databases, raidz, stripe size, and recordsize:
>> 
>> https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
>> 
>> And it seems like if I'm using 4k sector disks then I don't want to touch the recordsize, I want to let compression and ZFS sort out everything on its own. Is that correct? Does anyone have any experience with this?
> 
> It depends. If your data is by and large append-only and your transactions are large (many inserts before a commit), you may well want to use a large recordsize to achieve better compression. In reality, DB workloads are usually not that well behaved and you will get a lot of updates, deletes, and page merges and overflows.
> 
> PostgreSQL uses 8KB pages, InnoDB defaults to 16KB (this can be changed up to 64KB, IIRC, but last I checked it required rebuilding the MySQL package, and when you are upgrading, the upgrade package will have to be built with the same page size). You should set the recordsize to match the DB's page size. Note that there are other storage engines available for MySQL which will have different characteristics and benefit from different page sizes.
> 
> Additionally, DO NOT use InnoDB compression, as this is both slower, typically somewhat less efficient and is likely to upset your carefully set up block alignment.
> 
> Note that recordsize are per vdev, not per stripe, so you don't need to worry about a record spanning vdevs and upsetting alignment with mirrored vdevs.
> 
> Other settings you may want to consider include:
> ZFS:
> logbias=throughput
> 
> MySQL:
> innodb_doublewrite=0 (ZFS is CoW, ergo you cannot get a partial log write, ergo no need do double-write the log to catch partial writes due to a crash)
> innodb_checksum_algorithm=none (ZFS will do the checksumming (and auto-healing) for you, no need to waste CPU cycles on InnoDB doing it as well)
> innodb_flush_method=fsync (default, since you can't use O_DIRECT on ZFS)
> Set transaction isolation level to "read committed". Default is "repeatable read", which has some concurrency performance implications. Make sure your application handles data in a way that is safe for "read committed" isolation level first, of course.
>  
>> 
>> Another case I've considered is exporting a zvol over iscsi from the zfs storage server to a database server. In that case, the default volblocksize is 8k, but again, if I'm planning to use lz4 compression on this workload, is it better to set a larger default volblocksize?
> 
> 
> The above holds. Set it to the DB's page size for optimal results.
> 
>  
>> Also, if the database server and storage server are different machines, its probably a good idea to set primarycache=all because the storage server doesn't know about the database, uses its own memory pool, and keeping as much data cached in memory as possible on the storage server will be better for the overall database performance.
> 
> 
> 
> If you have RAM to burn on the storage server, maybe - but then you should be running the DB locally on that server. :-)
> For optimal results, unless your OS cache is substantially bigger (some multiple) of your DB's buffer pool, you should be using the buffer pool rather than the OS cache. Caches further away are also only effective if they are some multiple of the size of the cache before them. So you are not likely to gain anything from having primarycache=all on the iSCSI target, because unless the cache on that machine is significantly bigger than your DB's buffer pool, the same data will be in both (and it will therefore never be hit in the cache that is further away.
> 
> Gordan
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20161229/cc59ab5f/attachment.html>


More information about the zfs-discuss mailing list