Re: [PATCH v6 01/20] ext4: update docs for fast commit feature

Andreas Dilger <adilger@xxxxxxxxx> · Thu, 9 Apr 2020 16:01:39 -0600

On Apr 8, 2020, at 3:55 PM, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote:
> 
> From: Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx>
> 
> This patch series adds support for fast commits which is a simplified
> version of the scheme proposed by Park and Shin, in their paper,
> "iJournaling: Fine-Grained Journaling for Improving the Latency of
> Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> give the client file system an opportunity to perform a faster
> commit. Only if the file system cannot perform such a commit
> operation, then JBD2 should fall back to traditional commits.
> 
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx>

It's not clear if all of this was intended to be in a 00/20 email,
but having it (probably minus full diffstat) in Git is OK as well.

Reviewed-by: Andreas Dilger <adilger@xxxxxxxxx>

> ---
> Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
> Documentation/filesystems/journalling.rst  |  18 +++
> 2 files changed, 139 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> index ea613ee701f5..f94e66f2f8c4 100644
> --- a/Documentation/filesystems/ext4/journal.rst
> +++ b/Documentation/filesystems/ext4/journal.rst
> @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> disk before the metadata are written to disk through the journal.
> 
> The journal inode is typically inode 8. The first 68 bytes of the
> -journal inode are replicated in the ext4 superblock. The journal itself
> -is normal (but hidden) file within the filesystem. The file usually
> -consumes an entire block group, though mke2fs tries to put it in the
> -middle of the disk.
> +journal inode are replicated in the ext4 superblock. The journal
> +itself is normal (but hidden) file within the filesystem. The file
> +usually consumes an entire block group, though mke2fs tries to put it
> +in the middle of the disk.
> 
> All fields in jbd2 are written to disk in big-endian order. This is the
> opposite of ext4.
> @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
> The maximum size of a journal embedded in an ext4 filesystem is 2^32
> blocks. jbd2 itself does not seem to care.
> 
> +Fast Commits
> +~~~~~~~~~~~~
> +
> +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> +Fast commits store metadata changes made to the file system as inode level
> +diff. In other words, each fast commit block identifies updates made to
> +a particular inode and collectively they represent total changes made to
> +the file system.
> +
> +A fast commit is valid only if there is no full commit after that particular
> +fast commit. Because of this feature, fast commit blocks can be reused by
> +the following transactions.
> +
> +Each fast commit block stores updates to 1 particular inode. Updates in each
> +fast commit block are one of the 2 types:
> +- Data updates (add range / delete range)
> +- Directory entry updates (Add / remove links)
> +
> +Fast commit blocks must be replayed in the order in which they appear on disk.
> +That's because directory entry updates are written in fast commit blocks
> +in the order in which they are applied on the file system before crash.
> +Changing the order of replaying for directory entry updates may result
> +in inconsistent file system. Note that only directory entry updates need
> +ordering, data updates, since they apply to only one inode, do not require
> +ordered replay. Also, fast commits guarantee that file system is in consistent
> +state after replay of each fast commit block as long as order of replay has
> +been followed.
> +
> +Note that directory inode updates are never directly recorded in fast commits.
> +Just like other file system level metaata, updates to directories are always
> +implied based on directory entry updates stored in fast commit blocks.
> +
> +Based on which directory entry updates are committed with an inode, fast
> +commits have two modes of operation:
> +
> +- Hard Consistency (default)
> +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> +
> +When hard consistency is enabled, fast commit guarantees that all the updates
> +will be committed. After a successful replay of fast commits blocks
> +in hard consistency mode, the entire file system would be in the same state as
> +that when fsync() returned before crash. This guarantee is similar to what
> +jbd2 gives.
> +
> +With soft consistency, file system only guarantees consistency for the
> +inode in question. In this mode, file system will try to write as less data
> +to the backed as possible during the commit time. To be precise, file system
> +records all the data updates for the inode in question and directory updates
> +that are required for guaranteeing consistency of the inode in question.
> +
> Layout
> ~~~~~~
> 
> Generally speaking, the journal has this format:
> 
> .. list-table::
> -   :widths: 16 48 16
> +   :widths: 16 48 16 18
>    :header-rows: 1
> 
>    * - Superblock
>      - descriptor\_block (data\_blocks or revocation\_block) [more data or
>        revocations] commmit\_block
>      - [more transactions...]
> +     - [Fast commits...]
>    * -
>      - One transaction
>      -
> +     -
> 
> Notice that a transaction begins with either a descriptor and some data,
> or a block revocation list. A finished transaction always ends with a
> @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
> superblock.
> 
> .. list-table::
> -   :widths: 12 12 12 32 12
> +   :widths: 12 12 12 32 12 12
>    :header-rows: 1
> 
>    * - 1024 bytes of padding
> @@ -85,11 +137,13 @@ superblock.
>      - descriptor\_block (data\_blocks or revocation\_block) [more data or
>        revocations] commmit\_block
>      - [more transactions...]
> +     - [Fast commits...]
>    * -
>      -
>      -
>      - One transaction
>      -
> +     -
> 
> Block Header
> ~~~~~~~~~~~~
> @@ -609,3 +663,64 @@ bytes long (but uses a full block):
>      - h\_commit\_nsec
>      - Nanoseconds component of the above timestamp.
> 
> +Fast Commit Block
> +~~~~~~~~~~~~~~~~~
> +
> +The fast commit block indicates an append to the last commit block
> +that was written to the journal. One fast commit block records updates
> +to one inode. So, typically you would find as many fast commit blocks
> +as the number of inodes that got changed since the last commit. A fast
> +commit block is valid only if there is no commit block present with
> +transaction ID greater than that of the fast commit block. If such a
> +block a present, then there is no need to replay the fast commit
> +block.
> +
> +.. list-table::
> +   :widths: 8 8 24 40
> +   :header-rows: 1
> +
> +   * - Offset
> +     - Type
> +     - Name
> +     - Descriptor
> +   * - 0x0
> +     - journal\_header\_s
> +     - (open coded)
> +     - Common block header.
> +   * - 0xC
> +     - \_\_le32
> +     - fc\_magic
> +     - Magic value which should be set to 0xE2540090. This identifies
> +       that this block is a fast commit block.
> +   * - 0x10
> +     - \_\_u8
> +     - fc\_features
> +     - Features used by this fast commit block.
> +   * - 0x11
> +     - \_\_le16
> +     - fc_num_tlvs
> +     - Number of TLVs contained in this fast commit block
> +   * - 0x13
> +     - \_\_le32
> +     - \_\_fc\_len
> +     - Length of the fast commit block in terms of number of blocks
> +   * - 0x17
> +     - \_\_le32
> +     - fc\_ino
> +     - Inode number of the inode that will be recovered using this fast commit
> +   * - 0x2B
> +     - struct ext4\_inode
> +     - inode
> +     - On-disk copy of the inode at the commit time
> +   * - <Variable based on inode size>
> +     - struct ext4\_fc\_tl
> +     - Array of struct ext4\_fc\_tl
> +     - The actual delta with the last commit. Starting at this offset,
> +       there is an array of TLVs that indicates which all extents
> +       should be present in the corresponding inode. Currently,
> +       following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> +       should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> +       that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> +       (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> +       (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> +       (dentry that for the file that should be created for the first time).
> diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> index 58ce6b395206..1cb116ab27ab 100644
> --- a/Documentation/filesystems/journalling.rst
> +++ b/Documentation/filesystems/journalling.rst
> @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
> ``transaction->t_private_list`` for attaching entries to a transaction
> that need processing when the transaction commits.
> 
> +JBD2 also allows client file systems to implement file system specific
> +commits which are called as ``fast commits``. Fast commits are
> +asynchronous in nature i.e. file systems can call their own commit
> +functions at any time. In order to avoid the race with kjournald
> +thread and other possible fast commits that may be happening in
> +parallel, file systems should first call
> +:c:func:`jbd2_start_async_fc()`. File system can call
> +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> +commits. Once a fast commit is completed, file system should call
> +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> +committers and the kjournald thread.  After performing either a fast
> +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> +file systems to perform cleanups for their internal fast commit
> +related data structures. At the replay time, JBD2 passes each and
> +every fast commit block to the file system via
> +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> +mechanism to improve journal commit performance.
> +
> JBD2 also provides a way to block all transaction updates via
> :c:func:`jbd2_journal_lock_updates()` /
> :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> --
> 2.26.0.110.g2183baf09c-goog
> 

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP