I'd like to have a conversation about how we can better support future Shingled Magnetic Recording (SMR) hard drives. One area for which I'd like to work with people in the block device layer is how to expose the proposed extensions to SCSI protocol to better support SMR drives[1] [1] http://www.digitalpreservation.gov/meetings/documents/storage13/DaveAnderson_Standards.pdf I believe we need to make these extensions available both to underlying file systems, as well to userspace programs accessing the block device directly. This includes how we expose information about the geometry of the SMR zones, how to get the current location of a SMR zone's write pointer, how to reset a zone's write pointer (thus clearing all of the previously written data to that zone), etc. It may also be interesting to consider how we might expose the nature of the SMR device to userspace programs which are accessing the device through a file system, which may have been potentially modified to make it be more SMR friendly. I've enclosed some initial thinking that I've made about how to modify ext4 to make its metadata write patterns much more SMR friendly. If this is combined with changes to ext4's block allocation algorithms, it should be possible to turn ext4 into an SMR-ready file system with relatively little effort. Cheers, - Ted SMR-Friendly Journal for Ext4 Version 0.11 January 9, 2014 Goal ==== The goal is to make the write patterns used by the ext4 journal and its metadata more friendly for hard drives using Shingled Magnetic Recording (SMR) by significantly reducing random writes seen by the SMR drive. It is primarily targetting drives which are providing either Drive-Managed or Cooperatively Managed SMR. By removing the need for random writes, this proposal can also improve the performance of ext4 on more flash storage devices that have a more simplistic Flash Translation Layer (FTL), such as those found on SD and eMMC devices. Non-Goals --------- This proposal does not address how data blocks are allocated. Nor does it address files which are modified they are first created (i.e., a random read/write workload); we assume here that for many use cases, the use of files which are modified after they are first created using a random write pattern is rarer than the use case where files which are written once and then not modified until they are replaced or deleted. Background ========== Singled Magnetic Recording -------------------------- Drives using SMR technology (sometimes called shingled drives) are broken up into zones or bands, which will typically be 32-256 MB in size[1]. Each band has a write pointer, and it is possible to write to each band by appending to it, but once written, it can not be rewritten, except by resetting the write pointer to the beginning at the band and erasing the contents of the entire band. [1] Storage systems for Shingled Disks, Garth Gibson, SDC 2012 presentation. For more details about why drive vendors are moving to SMR, and details regarding the different access models that have proposed for SMR drives, please see [2]. [2] Shingled Magnetic Recording: Areal Density Increase Requires New Data Management, by Tim Feldman and Garth Gibson, ; login:, June 2013. Vol 38, No. 3., pg 22. The Ext4 Journal ---------------- The ext4 file system uses a physical block journal. This means when a metadata block is modified, the entire metadata block is written to the journal before the transaction is committed. Before the transaction is committed, the block may not be written to the final location on disk. Once the commit block is written, then dirty metadata blocks may get written back to disk by Linux's buffer cache, which manages the writeback of dirty buffers. The journal is treated as a circular buffer, with modified metadata blocks and commit blocks appended to the end of the circular buffer. When the all of the blocks associated with the commit at the end of the journal have been written back to disk, the commit can be retired, and the journal superblock can be updated to move pointer to the head of the journal to first commit that still has dirty buffers associated with it which are pending writeback. (The process of retiring the oldest commits is called "checkpointing" in the ext4 journal implementation.) To recover from a system crash, the kernel or the file system consistency check program starts from the beginning of the journal, writing blocks found in the journal to their appropriate location on disk. For more information about the ext4 journal, please see [3]. [3] "Journaling the Linux ext2fs Filesystem," by Stephen Tweedie, in the Proceeding of Linux Expo '98. Design ====== The key insight in making the ext4's metadata updates more friendly is that the writes to the journal are ideal from the perspective of writes to a shingled disk --- or for a flash device with a simplistic FTL, such as those found on many eMMC devices found in mobile handsets. It is after the journal commit, when the updates to the allocation bitmaps, the inode table, directory blocks, which are random writes that are less optimal from the perspective of a Flash Translation Layer or the SMR drive's management layer. So we apply the Smith and Dale technique[4]: Patient: Doctor, it hurts when I do _this_. Doctor Kronkheit: Don't _do_ that. [4] Doctor Kronkheit and His Only Living Patient, Joe Smith and Charlie Dale, 1920's American vaudeville comedy team. The simplest implementation of this design does not require making any on-disk format changes. We simply suppress the writeback of the dirty metadata block to the file system. Instead we keep a journal map in memory, which maps metadata block numbers (or data block numbers if data journaling is enabled) to a block number in the journal. The journal is not truncated when the file system is unmounted, and so there is no difference between mounting a file system which has been cleanly unmounted or after a system crash. In both case, the ext4 file system will scan the journal, and create an in-memory data structure which maps metadata block locations to their location in the journal. When a metadata block (or a data block, if data journaling is enabled) needs to be read, if the block number is found in the journal map, the block is read from the journal instead of from its "real" location on disk. Eventually, we will run out of room in the journal, and so we will need to retire commits from the head of the journal. For each block referenced in the commit at the head of the journal, if it is has since been updated in a newer commit, then no action will be needed. For a block that has not been updated in a newer commit, there are two choices. The checkpoint operation could either copy the block to the tail of the journal, or write the block back to its final / "permanent" location on disk. The latter is preferable if it is unlikely that the block will needed again, or if space is needed in the journal for other metadata blocks. On the other hand, writing the block to the final location on disk will entail a random write, which will be especially expensive on SMR disks. Some experimentation may be needed to determine the best heuristics to use. Avoiding Updating the Journal Superblock ---------------------------------------- The basic scheme described above has does not require any format changes. However, while it eliminates most of the random writes associated with the file system metadata, the journal superblock must be updated each time the journal layer performs a "checkpoint" operation to retire the oldest commits from the head of the journal, so that the starting point of the journal can be identified. This can be avoided by modifying the commit block to include the head of the journal at the time of the commit, and then by requiring that first block of each zone must be a jbd2 control block. Since each control block contains the sequence number, the mount operation simply needs to scan the first block in each zone to find the control block with the highest commit ID, and then parse the journal until the last valid commit block is found. Once the tail of the journal has been identified, the last commit block will contain a pointer to the head of the journal. Applicability to other storage technologies =========================================== This design was originally designed to improve ext4's performance on SMR devices. However, it it may be helpful for flash based devices, since it reduces the write load caused by metadata blocks, since very often the a particular metadata block will be updated in multiple commits. Even on a hard drive, the reduction in writes and seek traffic may be worthwhile. Although we will need to benchmark this new scheme, this modified journaling scheme should be at least as efficient as the current mechanism used in the ext4/jbd2 implementation. If this is true, it may make sense to this be the default. Other advantages ================ One other advantage of this scheme is that it enables true read-only mounts and file system consistency checks. Currently, even when a file system is mounted read-only, if the system had been uncleanly shut down, the journal must be required before it can be mounted, even if it is being mounted read-only. Similarly, it is currently not possible to run a read-only file system consistency check. Even if the e2fsck is run with the -n option, if the journal has not been truncated as part of a clean unmount, the device must be opened in read/write mode to replay the journal before the consistency check can proceed. There are situations, such as after a file system has been marked as containing inconsistencies, and where there is some question about whether this was caused by hardware errors, avoiding making any changes to the disk might be help reduce further data loss caused by continuing hardware problems. Conclusion ========== In this proposal we have examined a proposed modification to ext4's journalling system which should significanly reduce the number of random writes to the storage device, thus making ext4 much more SMR friendly. Essentially, we transform ext4 journal into a construct which functions much like a log structured file system at least for metadata updates. Since the changes required to the ext4 file system is minimal, it should allow rapid deployment of a modified file system which can much more efficiently support SMR drives. Acknowledgements ================ Thanks to Andreas Dilger and Lukas Czerner who have reviewed and made valuable suggestions and comments on previous versions of this design idea, on the weekly ext4 conference call, as well as on the ext4 mailing list. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html