Re: [PATCH v4 01/20] ext4: update docs for fast commit feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Xiaohui,

No worries. I think the reason we have so many patches is that this
feature requires on-disk format change and it's better to get
everything in one go, so that the size of testing matrix doesn't
become too big :)

- Harshad

On Fri, Jan 10, 2020 at 1:50 AM xiaohui li
<lixiaohui1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> hi Harshad:
>
> I apologize for my some direct and improper speaking in my last email.
>
> what i want to say in my last email is that maybe an iterative
> software development method can be better for patches application.
> for the first release version, we can not do everything. it is good
> enough if we can have finished just one major function in ext4 fast
> commit field.
>
> I have known that develop work of this fast commit function is more
> difficult and more complex.
> so i am very grateful for your work on field of ext4 fast commit development. :)
>
> On Thu, Jan 9, 2020 at 12:29 PM xiaohui li
> <lixiaohui1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > hi Harshad
> > cc ted
> >
> > sorry, but i have some idea about this fast commit which i want to
> > share with you.
> >
> > there are nearly 20 patches about this v4 fast commit , so many patches.
> > I wonder if necessary to make this fast commit function so complexly.
> >
> > maybe i have not understand the difficulty of the fast commit coding work.
> > so I appreciate it very much if you give some more detailed
> > descriptions about the patches correlationship of v4 fast commit,
> > especially the reason why need have so many patches.
> >
> > from my viewpoint, the purpose of doing this fast commit function is
> > to resolve the ext4 fsync time-cost-so-much problem.
> > firstly we need to resolve some actual customer problems which exist
> > in ext4 filesystems when doing this fast commit function.
> >
> > so the first release version of fast commit is just only to accomplish
> > the goal of reducing the time cost of fsync because of jbd2 order
> > shortcoming described in ijournal paper from my opinion.
> > it need not do so many other unnecessary things.
> >
> > if i have free time , I will review these patches continually.
> >  thank you for your reply.
> >
> >
> >
> >
> >
> > On Tue, Dec 24, 2019 at 4:14 PM Harshad Shirwadkar
> > <harshadshirwadkar@xxxxxxxxx> wrote:
> > >
> > > This patch series adds support for fast commits which is a simplified
> > > version of the scheme proposed by Park and Shin, in their paper,
> > > "iJournaling: Fine-Grained Journaling for Improving the Latency of
> > > Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> > > give the client file system an opportunity to perform a faster
> > > commit. Only if the file system cannot perform such a commit
> > > operation, then JBD2 should fall back to traditional commits.
> > >
> > > Because JBD2 operates at block granularity, for every file system
> > > metadata update it commits all the changed blocks are written to the
> > > journal at commit time. This is inefficient because updates to some
> > > blocks that JBD2 commits are derivable from some other blocks. For
> > > example, if a new extent is added to an inode, then corresponding
> > > updates to the inode table, the block bitmap, the group descriptor and
> > > the superblock can be derived based on just the extent information and
> > > the corresponding inode information. So, if we take this relationship
> > > between blocks into account and replay the journalled blocks smartly,
> > > we could increase performance of file system commits significantly.
> > >
> > > Fast commits introduced in this patch have two main contributions:
> > >
> > > (1) Making JBD2 fast commit aware, so that clients of JBD2 can
> > >     implement fast commits
> > >
> > > (2) Add support in ext4 to use JBD2's new interfaces and implement
> > >     fast commits.
> > >
> > > Ext4 supports two modes of fast commits: 1) fast commits with hard
> > > consistency guarantees 2) fast commits with soft consistency guarantees
> > >
> > > When hard consistency is enabled, fast commit guarantees that all the
> > > updates will be committed. After a successful replay of fast commits
> > > blocks in hard consistency mode, the entire file system would be in
> > > the same state as that when fsync() returned before crash. This
> > > guarantee is similar to what jbd2 gives with full commits.
> > >
> > > With soft consistency, file system only guarantees consistency for the
> > > inode in question. In this mode, file system will try to write as less
> > > data to the backend as possible during the commit time. To be precise,
> > > file system records all the data updates for the inode in question and
> > > directory updates that are required for guaranteeing consistency of the
> > > inode in question.
> > >
> > > In our evaluations, fast commits with hard consistency performed
> > > better than fast commits with soft consistency. That's because with
> > > hard consistency, a fast commit often ends up committing other inodes
> > > together, while with soft consistency commits get serialized. Future
> > > work can look at creating hybrid approach between the two extremes
> > > that are there in this patchset.
> > >
> > > Testing
> > > -------
> > >
> > > e2fsprogs was updated to set fast commit feature flag and to ignore
> > > fast commit blocks during e2fsck.
> > >
> > > https://github.com/harshadjs/e2fsprogs.git
> > >
> > > After applying all the patches in this series, following runs of
> > > xfstests were performed:
> > >
> > > - kvm-xfstest.sh -g log -c 4k
> > > - kvm-xfstests.sh smoke
> > >
> > > All the log tests were successful and smoke tests didn't introduce any
> > > additional failures.
> > >
> > > Performance Evaluation
> > > ----------------------
> > >
> > > Ext4 file system performance was tested with full commits, with fast
> > > commits with soft consistency and with fast commits with hard
> > > consistency. fs_mark benchmark showed that depending on the file size,
> > > performance improvement was seen up to 50%. Soft fast commits performed
> > > slightly worse than hard fast commits. But soft fast commits ended up
> > > writing slightly lesser number of blocks on disk.
> > >
> > > Changes since V3:
> > >
> > > - Removed invocation of fast commits from the jbd2 thread.
> > >
> > > - Removed sub transaction ID from journal_t.
> > >
> > > - Added rename, truncate, punch hole support.
> > >
> > > - Added soft consistency mode and hard consistency mode.
> > >
> > > - More bug fixes and refactoring.
> > >
> > > - Added better debugging support: more tracepoints and debug mount
> > >   options.
> > >
> > > Harshad Shirwadkar(20):
> > >  ext4: add debug mount option to test fast commit replay
> > >  ext4: add fast commit replay path
> > >  ext4: disable certain features in replay path
> > >  ext4: add idempotent helpers to manipulate bitmaps
> > >  ext4: fast commit recovery path preparation
> > >  jbd2: add fast commit recovery path support
> > >  ext4: main commit routine for fast commits
> > >  jbd2: add new APIs for commit path of fast commits
> > >  ext4: add fast commit on-disk format structs and helpers
> > >  ext4: add fast commit track points
> > >  ext4: break ext4_unlink() and ext4_link()
> > >  ext4: add inode tracking and ineligible marking routines
> > >  ext4: add directory entry tracking routines
> > >  ext4: add generic diff tracking routines and range tracking
> > >  jbd2: fast commit main commit path changes
> > >  jbd2: disable fast commits if journal is empty
> > >  jbd2: add fast commit block tracker variables
> > >  ext4, jbd2: add fast commit initialization routines
> > >  ext4: add handling for extended mount options
> > >  ext4: update docs for fast commit feature
> > >
> > >  Documentation/filesystems/ext4/journal.rst |  127 ++-
> > >  Documentation/filesystems/journalling.rst  |   18 +
> > >  fs/ext4/acl.c                              |    1 +
> > >  fs/ext4/balloc.c                           |   10 +-
> > >  fs/ext4/ext4.h                             |  127 +++
> > >  fs/ext4/ext4_jbd2.c                        | 1484 +++++++++++++++++++++++++++-
> > >  fs/ext4/ext4_jbd2.h                        |   71 ++
> > >  fs/ext4/extents.c                          |    5 +
> > >  fs/ext4/extents_status.c                   |   24 +
> > >  fs/ext4/fsync.c                            |    2 +-
> > >  fs/ext4/ialloc.c                           |  165 +++-
> > >  fs/ext4/inline.c                           |    3 +
> > >  fs/ext4/inode.c                            |   77 +-
> > >  fs/ext4/ioctl.c                            |    9 +-
> > >  fs/ext4/mballoc.c                          |  157 ++-
> > >  fs/ext4/mballoc.h                          |    2 +
> > >  fs/ext4/migrate.c                          |    1 +
> > >  fs/ext4/namei.c                            |  172 ++--
> > >  fs/ext4/super.c                            |   72 +-
> > >  fs/ext4/xattr.c                            |    6 +
> > >  fs/jbd2/commit.c                           |   61 ++
> > >  fs/jbd2/journal.c                          |  217 +++-
> > >  fs/jbd2/recovery.c                         |   67 +-
> > >  include/linux/jbd2.h                       |   83 +-
> > >  include/trace/events/ext4.h                |  208 +++-
> > >  25 files changed, 3037 insertions(+), 132 deletions(-)
> > > ---
> > >  Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
> > >  Documentation/filesystems/journalling.rst  |  18 +++
> > >  2 files changed, 139 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > > index ea613ee701f5..f94e66f2f8c4 100644
> > > --- a/Documentation/filesystems/ext4/journal.rst
> > > +++ b/Documentation/filesystems/ext4/journal.rst
> > > @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> > >  disk before the metadata are written to disk through the journal.
> > >
> > >  The journal inode is typically inode 8. The first 68 bytes of the
> > > -journal inode are replicated in the ext4 superblock. The journal itself
> > > -is normal (but hidden) file within the filesystem. The file usually
> > > -consumes an entire block group, though mke2fs tries to put it in the
> > > -middle of the disk.
> > > +journal inode are replicated in the ext4 superblock. The journal
> > > +itself is normal (but hidden) file within the filesystem. The file
> > > +usually consumes an entire block group, though mke2fs tries to put it
> > > +in the middle of the disk.
> > >
> > >  All fields in jbd2 are written to disk in big-endian order. This is the
> > >  opposite of ext4.
> > > @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
> > >  The maximum size of a journal embedded in an ext4 filesystem is 2^32
> > >  blocks. jbd2 itself does not seem to care.
> > >
> > > +Fast Commits
> > > +~~~~~~~~~~~~
> > > +
> > > +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> > > +Fast commits store metadata changes made to the file system as inode level
> > > +diff. In other words, each fast commit block identifies updates made to
> > > +a particular inode and collectively they represent total changes made to
> > > +the file system.
> > > +
> > > +A fast commit is valid only if there is no full commit after that particular
> > > +fast commit. Because of this feature, fast commit blocks can be reused by
> > > +the following transactions.
> > > +
> > > +Each fast commit block stores updates to 1 particular inode. Updates in each
> > > +fast commit block are one of the 2 types:
> > > +- Data updates (add range / delete range)
> > > +- Directory entry updates (Add / remove links)
> > > +
> > > +Fast commit blocks must be replayed in the order in which they appear on disk.
> > > +That's because directory entry updates are written in fast commit blocks
> > > +in the order in which they are applied on the file system before crash.
> > > +Changing the order of replaying for directory entry updates may result
> > > +in inconsistent file system. Note that only directory entry updates need
> > > +ordering, data updates, since they apply to only one inode, do not require
> > > +ordered replay. Also, fast commits guarantee that file system is in consistent
> > > +state after replay of each fast commit block as long as order of replay has
> > > +been followed.
> > > +
> > > +Note that directory inode updates are never directly recorded in fast commits.
> > > +Just like other file system level metaata, updates to directories are always
> > > +implied based on directory entry updates stored in fast commit blocks.
> > > +
> > > +Based on which directory entry updates are committed with an inode, fast
> > > +commits have two modes of operation:
> > > +
> > > +- Hard Consistency (default)
> > > +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> > > +
> > > +When hard consistency is enabled, fast commit guarantees that all the updates
> > > +will be committed. After a successful replay of fast commits blocks
> > > +in hard consistency mode, the entire file system would be in the same state as
> > > +that when fsync() returned before crash. This guarantee is similar to what
> > > +jbd2 gives.
> > > +
> > > +With soft consistency, file system only guarantees consistency for the
> > > +inode in question. In this mode, file system will try to write as less data
> > > +to the backed as possible during the commit time. To be precise, file system
> > > +records all the data updates for the inode in question and directory updates
> > > +that are required for guaranteeing consistency of the inode in question.
> > > +
> > >  Layout
> > >  ~~~~~~
> > >
> > >  Generally speaking, the journal has this format:
> > >
> > >  .. list-table::
> > > -   :widths: 16 48 16
> > > +   :widths: 16 48 16 18
> > >     :header-rows: 1
> > >
> > >     * - Superblock
> > >       - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > >         revocations] commmit\_block
> > >       - [more transactions...]
> > > +     - [Fast commits...]
> > >     * -
> > >       - One transaction
> > >       -
> > > +     -
> > >
> > >  Notice that a transaction begins with either a descriptor and some data,
> > >  or a block revocation list. A finished transaction always ends with a
> > > @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
> > >  superblock.
> > >
> > >  .. list-table::
> > > -   :widths: 12 12 12 32 12
> > > +   :widths: 12 12 12 32 12 12
> > >     :header-rows: 1
> > >
> > >     * - 1024 bytes of padding
> > > @@ -85,11 +137,13 @@ superblock.
> > >       - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > >         revocations] commmit\_block
> > >       - [more transactions...]
> > > +     - [Fast commits...]
> > >     * -
> > >       -
> > >       -
> > >       - One transaction
> > >       -
> > > +     -
> > >
> > >  Block Header
> > >  ~~~~~~~~~~~~
> > > @@ -609,3 +663,64 @@ bytes long (but uses a full block):
> > >       - h\_commit\_nsec
> > >       - Nanoseconds component of the above timestamp.
> > >
> > > +Fast Commit Block
> > > +~~~~~~~~~~~~~~~~~
> > > +
> > > +The fast commit block indicates an append to the last commit block
> > > +that was written to the journal. One fast commit block records updates
> > > +to one inode. So, typically you would find as many fast commit blocks
> > > +as the number of inodes that got changed since the last commit. A fast
> > > +commit block is valid only if there is no commit block present with
> > > +transaction ID greater than that of the fast commit block. If such a
> > > +block a present, then there is no need to replay the fast commit
> > > +block.
> > > +
> > > +.. list-table::
> > > +   :widths: 8 8 24 40
> > > +   :header-rows: 1
> > > +
> > > +   * - Offset
> > > +     - Type
> > > +     - Name
> > > +     - Descriptor
> > > +   * - 0x0
> > > +     - journal\_header\_s
> > > +     - (open coded)
> > > +     - Common block header.
> > > +   * - 0xC
> > > +     - \_\_le32
> > > +     - fc\_magic
> > > +     - Magic value which should be set to 0xE2540090. This identifies
> > > +       that this block is a fast commit block.
> > > +   * - 0x10
> > > +     - \_\_u8
> > > +     - fc\_features
> > > +     - Features used by this fast commit block.
> > > +   * - 0x11
> > > +     - \_\_le16
> > > +     - fc_num_tlvs
> > > +     - Number of TLVs contained in this fast commit block
> > > +   * - 0x13
> > > +     - \_\_le32
> > > +     - \_\_fc\_len
> > > +     - Length of the fast commit block in terms of number of blocks
> > > +   * - 0x17
> > > +     - \_\_le32
> > > +     - fc\_ino
> > > +     - Inode number of the inode that will be recovered using this fast commit
> > > +   * - 0x2B
> > > +     - struct ext4\_inode
> > > +     - inode
> > > +     - On-disk copy of the inode at the commit time
> > > +   * - <Variable based on inode size>
> > > +     - struct ext4\_fc\_tl
> > > +     - Array of struct ext4\_fc\_tl
> > > +     - The actual delta with the last commit. Starting at this offset,
> > > +       there is an array of TLVs that indicates which all extents
> > > +       should be present in the corresponding inode. Currently,
> > > +       following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> > > +       should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> > > +       that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> > > +       (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> > > +       (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> > > +       (dentry that for the file that should be created for the first time).
> > > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > > index 58ce6b395206..1cb116ab27ab 100644
> > > --- a/Documentation/filesystems/journalling.rst
> > > +++ b/Documentation/filesystems/journalling.rst
> > > @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
> > >  ``transaction->t_private_list`` for attaching entries to a transaction
> > >  that need processing when the transaction commits.
> > >
> > > +JBD2 also allows client file systems to implement file system specific
> > > +commits which are called as ``fast commits``. Fast commits are
> > > +asynchronous in nature i.e. file systems can call their own commit
> > > +functions at any time. In order to avoid the race with kjournald
> > > +thread and other possible fast commits that may be happening in
> > > +parallel, file systems should first call
> > > +:c:func:`jbd2_start_async_fc()`. File system can call
> > > +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> > > +commits. Once a fast commit is completed, file system should call
> > > +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> > > +committers and the kjournald thread.  After performing either a fast
> > > +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> > > +file systems to perform cleanups for their internal fast commit
> > > +related data structures. At the replay time, JBD2 passes each and
> > > +every fast commit block to the file system via
> > > +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> > > +mechanism to improve journal commit performance.
> > > +
> > >  JBD2 also provides a way to block all transaction updates via
> > >  :c:func:`jbd2_journal_lock_updates()` /
> > >  :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> > > --
> > > 2.24.1.735.g03f4e72817-goog
> > >




[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux