ZFS : Tunning and Optimisation for MySQL / MariaDB

MySQL on ZFS - Performance

The recordsize property gives the maximum size of a logical block in a ZFS dataset. The default recordsize is 128 KB. Many database engines prefer smaller blocks, such as 4 KB or 8 KB. It makes sense to change the recordsize on datasets dedicated to such files. [8]

The most important tuning you can perform for an application is the dataset block size. If an application consistently writes blocks of a certain size, recordsize should match the block size used by the application. This becomes really important with databases [8]

Write amplification happens when changing a small amount of data requires writing a large amount of data. Suppose you must change 8 KB in the middle of a 128 KB block. ZFS must read the 128 KB, modify 8 KB somewhere in it, calculate a new checksum, and write the new 128 KB block. [8]

Newer versions of OpenZFS also contain a redundant_meta-data property, which defaults to all. This is the original behavior from previous versions of ZFS. However, this property can also be set to most, which causes ZFS to reduce the number of copies of some types of metadata that it keeps. [8]

Depending on your needs and workload, allowing the database engine to manage caching might be better. ZFS defaults to cach- ing much or all of the data from your database in the ARC, while the database engine keeps its own cache, resulting in wasteful double caching. Setting the primarycache property to metadata rather than the default all tells ZFS to avoid caching actual data in the ARC. The secondarycache property similarly controls the L2ARC. [8]

ZFS TUNING AND OPTIMISATION

sync=disabled
atime=off
compression=lz4
logbias=throughput
primarycache=metadata
recordsize=16k
xattr=sa
redundant_metadata=most
  • sync=disabled : disable all explicit disk flushing to any one file system. The background commit thread will still flush the written data to disk every 5 seconds, but flushing the data once per 5 seconds instead of for each transaction

zfs set sync=disabled

In technical terms, sync=disabled tells ZFS “when an application requests that you sync() before returning, lie to it.” If you don’t have applications explicitly calling sync(), this doesn’t result in any difference at all. If you do, it tremendously increases write performance… but, remember, it does so by lying to applications that specifically request that a set of data be safely committed to disk before they do anything else. TL;DR: don’t do this unless you’re absolutely sure you don’t give a crap about your applications’ data consistency safeguards! [4]

  • compression=lz4 : LZ4 compression is very fast. So fast, in fact, that if you have a modern processor, you are likely to be able to decompress the data much faster than your SSD can read the compressed data.

  • logbias=throughput : “If logbias is set to throughput, ZFS does not use the pool’s separate log devices. Instead, ZFS optimizes synchronous operations for global pool throughput and efficient use of resources. The default value is latency. For most configurations, the default value is recommended. Using the logbias=throughput value might improve performance for writing database files.”

  • primarycache=metadata Since InnoDB already does it’s own caching via the Buffer Pool, there is no benefit to also caching the data in the page cache, or in the case of ZFS, ARC. Since O_DIRECT doesn’t disable caching on ZFS, we can disable caching of data in the ARC by setting primarycache to only cache metadata (default is “all”).

  • recordsize=16k : InnoDB uses 16KB pages. Using this parameter, we tell ZFS to limit the block size to 16KB. This prevents multiple pages from being written in one larger block, as that would necessitate reading the entire larger block for future updates to any one page in that block.

  • xattr=sa: This instructs ZFS to store extended file attributes in the file’s inode instead of a hidden directory, since storing it in the inode is more efficient. This is only really relevant on systems that use SELinux, since this uses extended attributes for storing security contexts of files.

  • autoreplace=on : automatically activate hot spare drives if another drive fails [7]

  • redundant_metadata=most : ZFS stores an extra copy of all metadata by default, beyond the redundancy provided by mirroring. Because we’re prioritizing performance for a write-intensive workload, we lower this level of redundancy

command:

sudo zfs set  sync=disabled \
atime=off \
compression=lz4 \
logbias=throughput \
primarycache=metadata \
recordsize=16k \
xattr=sa \
redundant_metadata=most  default/containers/NAME
  • default : zfs pool name
  • NAME : lxc container name

Querying ZFS Properties

sudo zfs get all default/containers/NAME

MYSQL TUNING AND OPTIMIZATION FOR ZFS

# tunning for zfs 
innodb_log_write_ahead_size = 16384
innodb_doublewrite = 0
innodb_use_native_aio = 0
innodb_use_atomic_writes = 0
innodb_compression_default = OFF

innodb_flush_log_at_trx_commit=0 # disable disk flushing
sync_binlog = 0 # disable disk flushing

innodb_file_per_table = ON
  • innodb_log_write_ahead_size=16384 Optimise the interaction between the storage subsystem and MySQL / MariaDB. InnoDB defaults to 16KB page size for tablespaces, using the innodb_log_write_ahead_size option. That reduces the requirement to 16KB blocks for the InnoDB tablespaces and logs, and no fixed block size for the replication logs. If we are keeping the InnoDB logs in the data directory (as is the default), we should set this to what we set the ZFS recordsize=16K to.

  • innodb_doublewrite = 0: InnoDB double-write has one purpose – to prevent torn pages from corrupting the data in case of application crash. Because commits on ZFS are atomic, and we have aligned the InnoDB page size and innodb_log_write_ahead_size with ZFS recordsize, a torn page cannot occur – either the entire block is written, or the entire block is lost.

  • innodb_flush_neighbors=0: When it is time to flush out a page to disk, InnoDB also flushes all of the nearby pages as well. This is beneficial on spinning rust and with traditional file systems, this typically just wastes disk I/O. It is better to disable it on ZFS, even when using mechanical disks.

  • innodb_use_native_aio=0: On Linux, the ZFS driver’s AIO implementation is a compatibility shim.

  • innodb_use_atomic_writes=0: fully use of ZFS native AIO

  • innodb_compression_default=OFF: ZFS LZ4 compression is incredibly fast, so we should leave compression to ZFS and not use InnoDB’s built in page compression. Leaving compression to ZFS doesn’t disrupt the block alignment.

  • innodb_flush_log_at_trx_commit=0: disable disk flushing to temporarily boost write performance without restarting mysqld

  • sync_binlog=0: disable disk flushing

  • innodb_file_per_table=ON: store tables in individual files, for much easier backup, recovery, or relocation: [7]

  • innodb_io_capacity=1000: increase target IOPS above the defaults. [7]

  • innodb_io_capacity_max=2500: increase max IOPS above the defaults. [7]

SECTOR SIZE

Most modern disks have 4KB hardware sectors, even of they emulate 512 byte sectors. 4KB writes is the most commonly used block size on most file systems. it is usually a good idea to explicitly instruct it to create a pool with 4KB sector size using the ashift parameter:

zpool create -o ashift=12 tank

Tuning ZFS for Database Products

Benchmark