In this article, we touch on ZFS, one of the new file systems focused on reliability and performance. Zettabyte File System (ZFS) is an advanced file system originally developed by SunMicrosystems. The most important feature that distinguishes it from the others is that it provides high storage capability as well as protection against data corruption.
Introduction to ZFS File System: What is ZFS?
ZFS was originally developed by SunMicrosystems to be part of Solaris. After the OpenSolaris project was launched, ZFS was published as open-source software.
With Oracle’s acquisition and inclusion of SunMicrosystems in 2010, the publication of ZFS source code was stopped and an attempt was made to turn it into a closed source.
Upon this, the Illumos project was launched to continue open-source Solaris. Matt Ahrens, one of the architects of the ZFS file system, and a group of people started the OpenZFS project for continuous improvement of the ZFS format.
OpenZFS can be used today on Unix -based (Solaris, OpenSolaris) and Unix-like kernel systems (FreeBSD, Linux distributions) .
Login
To understand the ZFS file format well, it is necessary to master architecture it has. Combining traditional volume management and file system layers, ZFS uses the ” copy-on-write ” mechanism.
In this method, if a copy of a data or the same source has not changed at all, there is no need to create a new resource field again.
In this method, copy and real data have the same source. The actual copying process begins when the copied data is manipulated and edited.
The ZFS file system is much different from traditional file systems and RAID arrays. The first cornerstones to be known in this complex structure; zpool, vdev and devices.
zpool
ZFS is at the top of the structure are the zpool. Each zpool contains vdevs containing other devices. Physically, there can be more than one zpool on a computer, but each of these zpools are absolutely separate from each other. More than one zpool cannot share a vdev.
The redundancy of ZFS is on the extent of vdevs , not zpools. There is absolutely no redundancy at the Zpool level. In some cases, if a storage vdev or one of the custom vdevs disappears, zpool disappears with it.
In modern zpool structures , it is possible to get rid of the losses belonging to CACHE or LOG vdevs, but some data may be lost due to LOG vdev that will be damaged in the system error with a small power failure or interruption.
As it is thought, writing operations in ZFS do not consist of ” pools “. In the ZFS file system, zpool is not a classic RAID0 implementation, but a JBOD where the complex deployment mechanism can change.
In ZFS, usually written data is distributed among existing vdevs according to the free space on the disk, in theory, all vdevs become full.
In the latest versions of the ZFS file system, it becomes possible to set vdevs accordingly. If any vdev If it is much more intense or critical than others, writing to others can be temporarily bypassed.
Thanks to the mechanism built into modern ZFS writing and distribution methods, delays can be reduced during periods of higher load.
However, based on this, one should not confuse classic hard drives with today’s fast solid state drives (SSDs). In this way, the performance to be obtained in an incompatible order will be as slow as the slowest device.
vdev
vdev is called virtual devices that makeup zpools. The vdevs in each zpool actually consist of one or more devices. Although most Vdevs are used for storage purposes, they also have used in the form of CACHE (cache) and LOG (log), SPECIAL (private). In the ZFS file system, each of the vdevs is available for 5 structures. These are a single device; RAIDz1 is RAIDz2, RAIDz3 and mirror structure.
Three of these structures; RAIDz1, RAIDz2, RAIDz3 are special structures called ” diagonal parity RAID ” by storage experts. 1, 2 and 3 show how many parity blocks are allocated to anyone bus.
Instead divided into several disks accompanied by raıdz vdev on this in accompaniment blocks (parity) vdev in between evenly distributes half. A normal RAIDz structure can lose as many disks as its parity blocks. If another disk goes out of hand it will likewise disappear in zpool.
In virtual devices (vdev = virtual devices) used for mirroring purposes, each mirror is hosted on a vdev, and each block is hosted on a device in a vdev. Although there are two common mirror methods, vdev where a mirror can host any number of devices.
Generally, three approaches are followed in large structures in order to encounter high performance and less errors. Any devices in a mirror vdev can get rid of potential problems as long as they remain healthy.
Vdevs connected to a single device have a very dangerous structure. In case of any problem in vdevs with a single device, it causes the loss of all zpool for private or storage purposes.
CACHE, LOG or SPECIAL vdev structures can be created using one of the methods we mentioned, but the loss of a special vdev may also mean the disappearance of the “pool”. For this reason, it is recommended to install an extra structure.
Device
We can say that it is the most understandable of the ZFS terms we have mentioned so far. Besides the confusion of zpool and vdevs, they are random block access devices, as can be understood from the name devices/devices. As you may remember, vdevs consisted of some devices, zpools are vdevs.
Hard drives (HDD) or solid-state drives (SSD) are block devices that are often used as the building blocks of vdevs. Inside a device that allows random access, anything that has the identifier will start working. Therefore, hardware RAID arrays can be used as separate devices.
A simple RAW (raw) data is the most important alternative block device vdev can create. Repositories of files that are sparse in tests are a good way to execute zpool commands and view the domain of any structure.
Let’s say you will create a structure from 10 TB disks with 8 servers, but you are sure to consider what the best topology is. Raıdz2 formed by 8 Total 10 TB structure consisting disk vdev for 50 tib’lik offers a capacity.
The devices have separate use in a special class. Unlike normal devices, devices used for hotspare do not belong to a single vdev, but to the entire pool.
If any device in the created pool breaks down, the backups add themselves to the corrupted vdev if the new disk added to the pool is used. After adding to the corrupted virtual device, backups try to make copies of the lost data with their configurations.
This step in the conventional RAID ” re-building / rebuilding ” denilirken, ZFS file system ” to resurface / tin Resilvering meaning ” is called.
One of the most important points you should pay attention to in this structure is to know that backup devices do not cover problematic devices permanently.
Their only purpose is to cover times when vdev was running badly. When the problematic device of the virtual device (vdev) changes, the backups separate themselves from vdev and return to the pool.
Blocks, Data Sets and Sectors
One of the parts we need to pay attention to in order to better understand the highly complex ZFS file system, structures are more about storage than hardware.
Data Sets (Dataset)
A ZFS dataset looks like “any other folder”, similar to a normal file system. As with a conventional built-in file system, each data set has its own characteristics in ZFS.
One of these features is the quota that you have or can put at the beginning of the data set. For example, when you zfs set quota=100G poolname/datasetname
limit a certain data set o sistemdeki yerleşik klasör/poolname/datasetname
to only 100GiB of data, it is not accepted.
There is one more thing to know about the example we gave above. There is a hierarchy in the ZFS file system between each data set and the mounted/placed system. There is no leading slash in the naming of this file system. ” Poolname “, starting with ” DataSetName ” and the pool in question when the pool/parent/child way as monitored.
In the default configuration, there is a ” / ” line at the beginning of the point where each data set is placed, due to the hierarchy of the ZFS file system. A directory named pool is sorted as a parent dataset, a sub dataset. Although this hierarchy exists, it is possible to change the point at which a data set is mounted./pool
pool/parent
pool/parent/child
If zfs set mountpoint=/lol pool/parent/child
we configured it as it could have been connected to the dataset pool/parent/child
system /lol
as.
In addition to data sets, it is worth mentioning zvols. “Zvols” also basically resembles a dataset, but it doesn’t really have a file system. We can just call it a block device.
Blocks
In ZFS repositories, all data, including metadata, is stored in blocks. With the record size feature, a maximum size limit is placed on the block for each data set.
It is possible to change the record size and the constraint, but the established order of blocks with data already written in sets cannot be changed. You can only change it for the blocks you will write new data on
If the default configuration has not been tampered with, the record size of these blocks will be 128 KiB. In this case, you will get neither very good performance nor very bad performance.
The recording size can be set between 4K and 1M. If there are various reasons, it is possible to increase the recording size even more, but this is something that is rarely applied.
Any block only contains data in one file, you cannot keep more than two files in one block. It generally consists of multiple blocks depending on the size of each file.
Let’s say our file to be kept is smaller than the set record size (recordsize). In this case, this file is stored in the smaller size of the blocks. For example, if a block holds 2 KiB files, it only occupies a single 4 KiB sector on the disk.
Considering that a file is often large enough to occupy more than one block, it is up to the recordsize (record size), including the last record that might have free space.
There is no recordsize event in Zvols. There are volblocksize values that are generally equivalent instead of the record size.
Sectors
Sectors are the smallest physical device/volume that is physically written to or read from the device. For many years, most storage drives used 512-byte sectors. Recently, it is known that most disks use 4 KiB sectors, especially many SSDs use 8 KiB sectors. It is possible to manually adjust the size of the sectors called ” ashift ” in the ZFS file system.
If we go to the math of the matter , ” ashift ” is basically a ” binary ” exponent that represents the dimension of a sector. Considering that Ashift is set to 9, the size of this sector will be 2 to the 9, ie 512 bytes.
If ZFS is added to a new vdev, theoretically it queries the details of new block devices through the operating system and automatically ” ashift ” according to the information obtained. Unfortunately, today there are disks that send misleading sector information to ensure compatibility with Windows XP, which is still an 18 year old operating system.
In such cases, the person managing the ZFS configuration needs to know what the real sector values are and make the adjustments appropriately. If the ” ashift ” is set much lower than it should be, astronomically erroneous readings will rise.
If you write a 512 byte sector to a real sector that should be 4 KiB; first you have written the first sector, read 4 KiB sectors, and replace it with the second sector. There is a sector of 512 bytes for each script, and this extends as the new 4 KiB.
To give an example from the real world, such an inappropriate ZFS tuning affects even Samsung Evo SSDs very badly. The Ashift should actually be 13 but this is something related to the sector size.
By default, it will be ashift 9 if it is not disabled on systems configured by the ZFS manager. This is as demanding as a hard drive that will make the system appear slower.
Contrary to popular belief, it is okay to set the ” ashift ” size to be too high. It has no effect on real performance, the increments of free space remain as small as possible, usually zero if compression is active. If you actually have disks with sectors such as 512 bytes, it is recommended to set ashift=12
or at least for future use 13
.
Ashift is not per ” pool ” as it might seem. If you do something wrong while adding vdev to any pool, you will corrupt this pool with a low-performance virtual device (vdev) that cannot be restored.
In this case, there is usually no choice but to uninstall and re-configure the pool. Even if vdev is removed, unfortunately, it won’t save you from a ridiculous and problematic shift setting.
Copy-on-Write Structure
In fact, one of the main things underlying the beauty of the ZFS file system is its Copy-on-Write structure. This concept is actually pretty simple.
When you want to change the location of the file in an ordinary traditional file system, the desired action is performed. If copy-on-write does the same thing on a file system, it doesn’t actually do it even if it tells you it’s doing it.
In CoW, a new version of the block that you changed compared to traditional file systems is written, then the metadata is updated to remove the link associated with the old block, and the newly written block is linked.
There is no interruption while these transactions are taking place because only one transaction takes place when unlinking the old block.
If the power goes off right after this process happens then you will have the old version. In this case either way the file system would be fairly consistent.
In the ZFS file system, Copy-on-Write is not only found in the file system part, but also at the disk management level. A RAID prepared in this way indicates that the partial write-phase corruption and inconsistency before the system immediately crashes do not affect ZFS after reboot. There is always consistency.
ZIL – ZFS Intent Log
There are two write operators in ZFS. Many of these are asynchronous when writing jobs are taken into account in terms of workload. The file system combines them together and processes them in batches. This significantly reduces fragmentation in the data and increases the yield.
Synchronized writing is a completely different matter. If the application makes a synchronous (sync) write request, I must save it in the filesystem to the non-volatile storage, if this is not done it requests that I cannot do any further action. Therefore, whether the efficiency decreases or fragmentation increases, synchronous write operations must be recorded to disk.
Synchronous (sync) write operations in ZFS are different from classical file systems. As in the above paragraph, when such operations are in question, data in ZFS is not immediately recorded in normal storage. In the ZFS file system, this data is specially processed in a storage area called the ZFS Intent Log.
One of the points to be considered here is that the write operations are actually kept in memory. It then transfers it to storage as normal TGXs (transaction groups) with asynchronous write operations.
In normal operations/work, writing is made to ZIL and never read again. The connection with ZIL is disconnected after the writes recorded on the ZIL are placed in the original storage via RAM as normal TXGs after a while. The reading of the ZIL occurs only while importing the pool.
If the operation of ZFS is interrupted due to system errors, power cuts, etc., then the data on the ZIL will be read during a future pool import (for example, if the system is restarted), and the connection to the ZIL will be disconnected after the TXG is collected and transferred to the main storage unit.
A support vdev class is already available. These SLOGA (Secondary LOG) device (device) , also known as LogDir. SLOGAN The only function of the device, plowing in of the main storage in vdev in the pool instead of keeping the plowing of an individual to store vdev to create.
Obviously, this is not something that will affect the behavior of ZIL. The ZIL continues to behave in the same way, whether it’s the vdevs of the main storage or the vdevs generated by LOG. However, if LOG vdev can write very high speeds in terms of performance, synchronous write operations will occur very quickly.
Even if you add LOG vdev into a pool, this will not directly affect asynchronous write operations in terms of performance. Bell, zfs set sync=always
even if you use a setting results in each write operation to perform on the main storage form without LOG TXG are quickly before whatever it is stored quickly. The thing that directly affects the performance positively is the advantage of the delay in synchronous writings. Thanks to the high speed of LOG, synchronization calls are returned faster.
If a medium requires too synchronous write processing, LOG vdev can indirectly speed up the reading of uncached files and asynchronous writes. By dumping the ZIL into a separate LOG vdev, IOPS on the main storage can be slightly lighter. As a result, the performance in reading and writing operations increases to some extent.
Snapshots
Copy-on-Write is also required for ZFS file system snapshot and incremental asynchronous data replication. A live file system has a pointer tree that marks the locations of data records. When you take a snapshot, you make another copy of this pointer tree.
In the case of overwriting an existing record in live file systems, ZFS writes the new data block to the unused space. The file system is then disconnected from the older version of the block.
If a snapshot points to one of the old blocks, it does not change and remains so. The old block will not actually be free space unless the other snapshots that take advantage of this block we mentioned are eliminated.
Replication
According to snapshots mentioned under the previous heading replication an appropriate time to mention that the replica is what happened. As mentioned, a snapshot is actually a beacon tree for records. Therefore, if a ” zfs send
” snapshot is sent, records will be transmitted with that tree.
If those sent to the target are successfully transmitted and processed, it processes the contents of the actual blocks and the pointer tree that makes use of those blocks into the dataset on the target.
It zfs send
becomes very interesting if a second “ ” is done here. Now if you have two systems and a structure in the form of a poolname/[email protected]
new post- a snapshot, you poolname/[email protected]
can have a structure. Target only @ 1 You have now the resource pool (pool) DataSetName @ 1 and DataSetName @ 2 there are two of them.
Since there is a common snapshot ( datasetname @ 1 ) between target and source in the middle, incremental ” zfs ” can be built above it. zfs send -i poolname/[email protected] poolname/[email protected]
If we do it in the system then two pointer trees are compared. Of these points, only those found at @ 2 will reference new blocks. That’s why we must have the content of these blocks.
In incremental transmissions made on remote systems, piping is easy, similar to normal submissions. First of all, in the first step, all new records in the send flow are written and the pointers of the blocks are included. Here in the new system @ 2 is ready.
The ZFS asynchronous incremental replication method is a much better and advanced technique than the old non -snapshot-based rsync. In both techniques, basically only changed data is sent by cable.
What reveals the distinction between rsync and the new technique is that when processing with ” rsync “, the data on the disks of both sides must be read for control and comparison. In fact, ZFS replication of marker trees does not need to read anything outside.
Compression (Inline Compression):
Copy-on-Write does not only have a positive effect on the points we have touched on in other titles, but it also makes inline compression better and easier.
Compressing is somewhat problematic on traditional file systems where the change happens in place. The old data and the changed data must fit exactly into the same space.
If we consider a dataset starting to consist of 1 MB of zeros (0x00000000) in the middle of a file, it is quite easy to compress it into a single disk sector.
Well, have you ever wondered? This one mib’lık zero occupying the space, JPEG, such as noise or random so the compressed data what would we put in place? The 1 MiB data we have mentioned benefits from 256 4 KiB sectors. The big hole “in the middle ” of this file is only one sector.
Changed records are always written to the unused space in ZFS. Therefore, there is no such problem in ZFS. The original data block only takes a single 4 KiB sector. The newly registered occupies 256 of these sectors and still this is not a major problem.
The newly changed data stack in the ” middle ” of the file is written to the unused space, whether the dimensions change or not. It’s kind of like working another day in the same office for ZFS.
By default, the inline compression setting on ZFS is off. Today, ZLE, gzip (1-9) or LZ4 algorithms are also supported in ZFS.
- LZ4: It is one of the algorithms that allows very fast compression and decompression. It is efficient even when used on very low processors.
- GZIP: It is a compression algorithm that almost all Linux and Unix-like operating system users hear or know frequently. It has levels from 1 to 9, and the compression rate to CPU utilization increases depending on the height of the selected level. GZIP compression can be a high level and profitable compression algorithm in text-based data, but if there are other situations, it is possible to encounter processor insufficiency and bottleneck according to the data. We recommend that people using GZIP set to high levels pay attention to this.
- LZJB: It is the original compression algorithm used by ZFS. Today it has been discontinued and its use is not recommended. Compared to this, the LZ4 works superior in all areas.
- ZLE (Zero Level Encoding): In this compression algorithm, normal data remains as it is, while large zero sequences are compressed with the help of the algorithm. It ignores incompressible datasets (MP4, JPEG, or others with the best compression it can compress before). Compresses any blank spaces found only in recent records.
We recommend LZ4 compression among these algorithms for whatever purpose you use. If it encounters incompressible data in the LZ4 algorithm, there is little performance problem to be experienced. In typical data, what matters is the performance gain.
In the copy test performed in 2015 for a new Windows installation with the VM image (no program is installed on Windows, only the operating system.) Compressed with LZ4, the process progressed 27 percent faster.
ARC (Adaptive Replacement Cache)
ZFS is the only file system in modern file systems that do not store copies of the last read blocks on RAM with the help of the operating system’s page caching, but in memory with its own caching method.
Although it is said that such a caching mechanism can cause some problems, the ZFS file system does not react as quickly as the kernel itself to allocate special memory space. New allocate() call may fail for a while if the memory used by the ARC mechanism is needed. Basically, there is a good reason for that too.
Almost every known good operating system on the market (including Windows, BSD, Linux, and macOS) use the page caching mechanism via the LRU algorithm. In LRU, usually, when a block in the cache begins to be read, it moves that block to the top of the queue each time it is read. It can also push up those under the queue to cover other gaps in the cache.
Obviously, this does not mean much for a normal user, but if there is a large system that returns a lot of data on it, things change there. LRU can extract the most used blocks for blocks that will not need to be read from the cache in the future.
ARC is an algorithm with less purity. Basically, it can be more difficult to remove these blocks from the cache when the cached blocks are read, as the data will be stacked for some reason.
Even if it is removed, that block will be followed for a while. Similarly, if the extracted blocks need to be retrieved from the cache and read again, the same aggravation may occur.
In the end, the cache from the readings with the disc out of the readings is proportioned, the cache hit rate is generally high. This is very important in large systems.
The higher the cache hit rate, the less concurrent requests to this disk also mean much lower latency for other requests than before. The relaxed disk can thus handle requests that cannot be cached or received faster and more efficiently.
If you have read this article to the end, we can say that you have learned some basic information about the ZFS file system, the copy-on-write system with its structure, vdevs, zpools and sectors and blocks.