[LSF/MM TOPIC] Storage: SMR drives

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 08 Jan 2014 11:35:25 -0500

I'd like to have a conversation about how we can better support future
Shingled Magnetic Recording (SMR) hard drives.  One area for which I'd
like to work with people in the block device layer is how to expose the
proposed extensions to SCSI protocol to better support SMR drives[1]

[1]  http://www.digitalpreservation.gov/meetings/documents/storage13/DaveAnderson_Standards.pdf

I believe we need to make these extensions available both to underlying
file systems, as well to userspace programs accessing the block device
directly.  This includes how we expose information about the geometry of
the SMR zones, how to get the current location of a SMR zone's write
pointer, how to reset a zone's write pointer (thus clearing all of the
previously written data to that zone), etc.

It may also be interesting to consider how we might expose the nature of
the SMR device to userspace programs which are accessing the device
through a file system, which may have been potentially modified to make
it be more SMR friendly.

I've enclosed some initial thinking that I've made about how to modify
ext4 to make its metadata write patterns much more SMR friendly.  If
this is combined with changes to ext4's block allocation algorithms, it
should be possible to turn ext4 into an SMR-ready file system with
relatively little effort.

Cheers,

                                                - Ted

                     SMR-Friendly Journal for Ext4
                              Version 0.11
                            January 9, 2014

Goal
====

The goal is to make the write patterns used by the ext4 journal and its
metadata more friendly for hard drives using Shingled Magnetic Recording
(SMR) by significantly reducing random writes seen by the SMR drive.  It
is primarily targetting drives which are providing either Drive-Managed
or Cooperatively Managed SMR.

By removing the need for random writes, this proposal can also improve
the performance of ext4 on more flash storage devices that have a more
simplistic Flash Translation Layer (FTL), such as those found on SD and
eMMC devices.

Non-Goals
---------

This proposal does not address how data blocks are allocated.

Nor does it address files which are modified they are first created
(i.e., a random read/write workload); we assume here that for many use
cases, the use of files which are modified after they are first created
using a random write pattern is rarer than the use case where files
which are written once and then not modified until they are replaced or
deleted.

Background
==========

Singled Magnetic Recording
--------------------------

Drives using SMR technology (sometimes called shingled drives) are
broken up into zones or bands, which will typically be 32-256 MB in
size[1].  Each band has a write pointer, and it is possible to write to
each band by appending to it, but once written, it can not be rewritten,
except by resetting the write pointer to the beginning at the band and
erasing the contents of the entire band.

[1] Storage systems for Shingled Disks, Garth Gibson, SDC 2012
presentation.

For more details about why drive vendors are moving to SMR, and details
regarding the different access models that have proposed for SMR drives,
please see [2].

[2] Shingled Magnetic Recording: Areal Density Increase Requires New
Data Management, by Tim Feldman and Garth Gibson, ; login:, June 2013.
Vol 38, No. 3., pg 22.

The Ext4 Journal
----------------

The ext4 file system uses a physical block journal.  This means when a
metadata block is modified, the entire metadata block is written to the
journal before the transaction is committed.  Before the transaction is
committed, the block may not be written to the final location on disk.
Once the commit block is written, then dirty metadata blocks may get
written back to disk by Linux's buffer cache, which manages the
writeback of dirty buffers.

The journal is treated as a circular buffer, with modified metadata
blocks and commit blocks appended to the end of the circular buffer.
When the all of the blocks associated with the commit at the end of the
journal have been written back to disk, the commit can be retired, and
the journal superblock can be updated to move pointer to the head of the
journal to first commit that still has dirty buffers associated with it
which are pending writeback.  (The process of retiring the oldest
commits is called "checkpointing" in the ext4 journal implementation.)

To recover from a system crash, the kernel or the file system
consistency check program starts from the beginning of the journal,
writing blocks found in the journal to their appropriate location on
disk.

For more information about the ext4 journal, please see [3].

[3]  "Journaling the Linux ext2fs Filesystem," by Stephen Tweedie, in
the Proceeding of Linux Expo '98.

Design
======

The key insight in making the ext4's metadata updates more friendly is
that the writes to the journal are ideal from the perspective of writes
to a shingled disk --- or for a flash device with a simplistic FTL, such
as those found on many eMMC devices found in mobile handsets.  It is
after the journal commit, when the updates to the allocation bitmaps,
the inode table, directory blocks, which are random writes that are less
optimal from the perspective of a Flash Translation Layer or the SMR
drive's management layer.  So we apply the Smith and Dale technique[4]:

        Patient: Doctor, it hurts when I do _this_.
        Doctor Kronkheit: Don't _do_ that.

[4] Doctor Kronkheit and His Only Living Patient, Joe Smith and
Charlie Dale, 1920's American vaudeville comedy team.

The simplest implementation of this design does not require making any
on-disk format changes.  We simply suppress the writeback of the dirty
metadata block to the file system.  Instead we keep a journal map in
memory, which maps metadata block numbers (or data block numbers if data
journaling is enabled) to a block number in the journal.

The journal is not truncated when the file system is unmounted, and so
there is no difference between mounting a file system which has been
cleanly unmounted or after a system crash.  In both case, the ext4 file
system will scan the journal, and create an in-memory data structure
which maps metadata block locations to their location in the journal.
When a metadata block (or a data block, if data journaling is enabled)
needs to be read, if the block number is found in the journal map, the
block is read from the journal instead of from its "real" location on
disk.

Eventually, we will run out of room in the journal, and so we will need
to retire commits from the head of the journal.  For each block
referenced in the commit at the head of the journal, if it is has since
been updated in a newer commit, then no action will be needed.  For a
block that has not been updated in a newer commit, there are two
choices.   The checkpoint operation could either copy the block to the
tail of the journal, or write the block back to its final / "permanent"
location on disk.   The latter is preferable if it is unlikely that the
block will needed again, or if space is needed in the journal for other
metadata blocks.   On the other hand, writing the block to the final
location on disk will entail a random write, which will be especially
expensive on SMR disks.  Some experimentation may be needed to determine
the best heuristics to use.

Avoiding Updating the Journal Superblock
----------------------------------------

The basic scheme described above has does not require any format
changes.   However, while it eliminates most of the random writes
associated with the file system metadata, the journal superblock must be
updated each time the journal layer performs a "checkpoint" operation to
retire the oldest commits from the head of the journal, so that the
starting point of the journal can be identified.

This can be avoided by modifying the commit block to include the head of
the journal at the time of the commit, and then by requiring that first
block of each zone must be a jbd2 control block.  Since each control
block contains the sequence number, the mount operation simply needs to
scan the first block in each zone to find the control block with the
highest commit ID, and then parse the journal until the last valid
commit block is found.  Once the tail of the journal has been
identified, the last commit block will contain a pointer to the head of
the journal.

Applicability to other storage technologies
===========================================

This design was originally designed to improve ext4's performance on SMR
devices.  However, it it may be helpful for flash based devices, since
it reduces the write load caused by metadata blocks, since very often
the a particular metadata block will be updated in multiple commits.
Even on a hard drive, the reduction in writes and seek traffic may be
worthwhile.

Although we will need to benchmark this new scheme, this modified
journaling scheme should be at least as efficient as the current
mechanism used in the ext4/jbd2 implementation.  If this is true, it may
make sense to this be the default.

Other advantages
================

One other advantage of this scheme is that it enables true read-only
mounts and file system consistency checks.  Currently, even when a
file system is mounted read-only, if the system had been uncleanly
shut down, the journal must be required before it can be mounted, even
if it is being mounted read-only.

Similarly, it is currently not possible to run a read-only file system
consistency check.  Even if the e2fsck is run with the -n option, if
the journal has not been truncated as part of a clean unmount, the
device must be opened in read/write mode to replay the journal before
the consistency check can proceed.  There are situations, such as
after a file system has been marked as containing inconsistencies, and
where there is some question about whether this was caused by hardware
errors, avoiding making any changes to the disk might be help reduce
further data loss caused by continuing hardware problems.

Conclusion
==========

In this proposal we have examined a proposed modification to ext4's
journalling system which should significanly reduce the number of
random writes to the storage device, thus making ext4 much more SMR
friendly.  Essentially, we transform ext4 journal into a construct
which functions much like a log structured file system at least for
metadata updates.  Since the changes required to the ext4 file system
is minimal, it should allow rapid deployment of a modified file system
which can much more efficiently support SMR drives.

Acknowledgements
================

Thanks to Andreas Dilger and Lukas Czerner who have reviewed and made
valuable suggestions and comments on previous versions of this design
idea, on the weekly ext4 conference call, as well as on the ext4 mailing
list.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html