Re: [PATCH] Documentation: document the design of iomap and how to port

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 11, 2024 at 12:13:22PM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@xxxxxxxxxx> writes:
> 
> > On Mon, Jun 10, 2024 at 02:27:28PM +0530, Ritesh Harjani wrote:
> >> 
> >> Hello Darrick,
> >> 
> >> "Darrick J. Wong" <djwong@xxxxxxxxxx> writes:
> >> 
> >> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> >> >
> >> > This is the fourth attempt at documenting the design of iomap and how to
> >> 
> >> I agree that this isn't needed in the commit msg ("fourth attempt").
> >
> > Ok.  "Coapture the design of iomap and how to port..."
> >
> >> > port filesystems to use it.  Apologies for all the rst formatting, but
> >> > it's necessary to distinguish code from regular text.
> >> >
> >> > A lot of this has been collected from various email conversations, code
> >> > comments, commit messages, my own understanding of iomap, and
> >> > Ritesh/Luis' previous efforts to create a document.  Please note a large
> >> > part of this has been taken from Dave's reply to last iomap doc
> >> > patchset. Thanks to Ritesh, Luis, Dave, Darrick, Matthew, Christoph and
> >> > other iomap developers who have taken time to explain the iomap design
> >> > in various emails, commits, comments etc.
> >> >
> >> > Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> >> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> >> > Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> >> > Cc: Christian Brauner <brauner@xxxxxxxxxx>
> >> > Cc: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> >> > Cc: Jan Kara <jack@xxxxxxx>
> >> > Cc: Luis Chamberlain <mcgrof@xxxxxxxxxx>
> >> > Inspired-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
> >> 
> >> I am not sure if this is even a valid or accepted tag.
> >> But sure thanks! :)
> >
> > They're freeform tags, so they can be everything everyone wants them to
> > be!  Drum circle kumbaya etc. :P
> >
> >> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> >> > ---
> >> >  Documentation/filesystems/index.rst |    1 
> >> >  Documentation/filesystems/iomap.rst | 1060 +++++++++++++++++++++++++++++++++++
> >> >  MAINTAINERS                         |    1 
> >> >  3 files changed, 1062 insertions(+)
> >> >  create mode 100644 Documentation/filesystems/iomap.rst
> >> >
> >> > diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
> >> > index 8f5c1ee02e2f..b010cc8df32d 100644
> >> > --- a/Documentation/filesystems/index.rst
> >> > +++ b/Documentation/filesystems/index.rst
> >> > @@ -34,6 +34,7 @@ algorithms work.
> >> >     seq_file
> >> >     sharedsubtree
> >> >     idmappings
> >> > +   iomap
> >> >  
> >> >     automount-support
> >> >  
> >> > diff --git a/Documentation/filesystems/iomap.rst b/Documentation/filesystems/iomap.rst
> >> > new file mode 100644
> >> > index 000000000000..a478b55e4135
> >> > --- /dev/null
> >> > +++ b/Documentation/filesystems/iomap.rst
> >> > @@ -0,0 +1,1060 @@
> >> > +.. SPDX-License-Identifier: GPL-2.0
> >> > +.. _iomap:
> >> > +
> >> > +..
> >> > +        Dumb style notes to maintain the author's sanity:
> >> > +        Please try to start sentences on separate lines so that
> >> > +        sentence changes don't bleed colors in diff.
> >> > +        Heading decorations are documented in sphinx.rst.
> >> > +
> >> > +============================
> >> > +VFS iomap Design and Porting
> >> > +============================
> >> > +
> >> > +.. toctree::
> >> > +
> >> > +Introduction
> >> > +============
> >> > +
> >> > +iomap is a filesystem library for handling various filesystem operations
> >> > +that involves mapping of file's logical offset ranges to physical
> >> > +extents.
> >> > +This origins of this library is the file I/O path that XFS once used; it
> >> > +has now been extended to cover several other operations.
> >> > +The library provides various APIs for implementing various file and
> >>                         ^^^^ redundant "various"
> >> 
> >> > +pagecache operations, such as:
> >> > +
> >> > + * Pagecache reads and writes
> >> > + * Folio write faults to the pagecache
> >> > + * Writeback of dirty folios
> >> > + * Direct I/O reads and writes
> >> 
> >> Dax I/O reads and writes.
> >> ... as well please?
> >
> > It's really fsdax I/O reads, writes, loads, and stores, isn't it?
> >
> 
> It felt like dax_iomap_rw() belongs to fs/iomap. 
> But nevertheless, we could skip it if we are targetting fs/iomap/
> lib.

Logically, it does.  However, there's a fair bit of code in fs/dax.c
that is used by the iomap iterators; all of those would have to become
non-static symbols to make that separation happen.

Maybe it still should; the pagecache is already like that.

<snip>

> >> I think it is better if we describe ->iomap_begin and ->iomap_end
> >> in proper sub-sections. Otherwise this looks like we have clobbered
> >> all the information together :)
> >> 
> >> ->iomap_begin 
> >> ^^^^^^^^^^^^^^^^^
> >
> > Yes, I like the explicit section headings better.
> >
> 
> yup.
> 
> >> This either returns an existing mapping or reserve/allocates a new
> >> mapping.
> >
> > That's a filesystem specific detail -- all that iomap cares about is
> > that the fs communicates a mapping.  Maybe the fs actually had to do a
> > bunch of work to get that mapping, or maybe it's already laid out
> > statically, ala zonefs.  Either way, it's not a concern of the iomap
> > library.
> >
> >> logical file pos and length are in bytes which gets passed
> >> as function arguments. Filesystem returns the new mapping information
> >> within ``struct iomap`` which also gets passed as a function argument.
> >> Filesystems should provide the details of this mapping by filling
> >> various fields within ``struct iomap``.
> >
> > "iomap operations call ->iomap_begin to obtain one file mapping for the
> > range of bytes specified by pos and length for the file inode.  This
> > mapping should be returned through the iomap pointer.  The mapping must
> > cover at least the first byte of the supplied file range, but it does
> > not need to cover the entire requested range."
> >
> 
> I like it. Thanks for adding that detail in the last line.
> 
> >>   @srcmap agument:
> >>     Note that ->iomap_begin call has srcmap passed as another argument. This is
> >>     mainly used only during the begin phase for COW mappings to identify where
> >>     the reads are to be performed from. Filesystems needs to fill that mapping
> >>     information if iomap should read data for partially written blocks from a
> >>     different location than the write target [4].
> >>   @flags argument:
> >>     These are the operation types which iomap supports. 
> >>     IOMAP_WRITE: For doing write I/O.
> >>     <...>
> >>     IOMAP_ZERO:
> >>     IOMAP_REPORT:		
> >>     IOMAP_FAULT:		
> >>     IOMAP_DIRECT:		
> >>     IOMAP_NOWAIT:		
> >>     IOMAP_OVERWRITE_ONLY:
> >>     IOMAP_UNSHARE:
> >>     IOMAP_DAX:	
> >
> > I think it's /much/ more valuable to document the exact combinations
> > that will be passed to ->iomap_begin further down where we talk about
> > specific operations that iomap performs.
> >
> > Otherwise, someone is going to look at this list and wonder if they
> > really need to figure out what IOMAP_ZERO|IOMAP_FAULT|IOMAP_DAX means,
> > and if it's actually possible (it's not).
> >
> 
> Sure.
> 
> >> 
> >> ->iomap_end
> >> ^^^^^^^^^^^^^^^^^
> >> 
> >> Commit and/or unreserve space which was previously allocated/reserved
> >> in ``->iomap_begin``. For e.g. during buffered-io, when a short writes
> >> occurs, filesystem may need to remove the reserved space that was
> >> allocated during ->iomap_begin.
> >> For filesystems that use delalloc allocation, we may need to punch out
> >> delalloc extents from the range that are not dirty in
> >> the page cache. See comments in
> >> iomap_file_buffered_write_punch_delalloc() for more info [5][6].
> >> 
> >> (IMHO) I find above definitions more descriptive.
> >
> > I don't want to merge the general description with pagecache specific
> > areas.  They already cover punch_delalloc.
> >
> 
> sure.
> 
> >> > +
> >> > +Each iomap operation describes the requested operation through the
> >> > +``flags`` argument.
> >> > +The exact value of ``flags`` will be documented in the
> >> > +operation-specific sections below, but these principles apply generally:
> >> > +
> >> > + * For a write operation, ``IOMAP_WRITE`` will be set.
> >> > +   Filesystems must not return ``IOMAP_HOLE`` mappings.
> >> > +
> >> > + * For any other operation, ``IOMAP_WRITE`` will not be set.
> >> > +
> >> 
> >> Direct-io related operation which bypasses pagecache use IOMAP_DIRECT.
> >
> > That's covered in the pagecache/directio/dax subsection because I wanted
> > to document specific combinations that filesystem authors should expect.
> >
> 
> The points mentioned above were targetting buffered-io, dax, so I
> thought we could add direct-io related flag as well here.

Given all the confusion with the later sections I'll just remove it.

<snip>

> >> > +
> >> > +   * **IOMAP_INLINE**: The file range maps to the memory buffer
> >> > +     specified by ``inline_data``.
> >> > +     For write operation, the ``->iomap_end`` function presumably
> >> > +     handles persisting the data.
> >> 
> >> Is it? Or do we just mark the inode as dirty?
> >
> > gfs2 actually starts a transaction in ->iomap_begin and commits or
> > cancels it in ->iomap_end.
> >
> 
> ok.
> 
> >> > +     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
> >> > +
> >> > + * ``flags`` describe the status of the space mapping.
> >> > +   These flags should be set by the filesystem in ``->iomap_begin``:
> >> > +
> >> > +   * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
> >> > +     Areas that will not be written to must be zeroed.
> >> 
> >> In case of DAX, we have to invalidate those existing mappings which
> >> might have a "hole" page mapped.
> >
> > Isn't that an internal detail of the fs/dax.c code?  The filesystem
> > doesn't have to do the invalidation or even know about hole pages.
> >
> 
> Right. Sorry about that. I assumed dax_iomap_rw() implementation
> is a part of iomap :)
> 
> >> > +     If a write fails and the mapping is a space reservation, the
> >> > +     reservation must be deleted.
> >> > +
> >> > +   * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
> >> > +     to access any data written.
> >> > +     fdatasync is required to commit these changes to persistent
> >> > +     storage.
> >> > +     This needs to take into account metadata changes that *may* be made
> >> > +     at I/O completion, such as file size updates from direct I/O.
> >> > +
> >> > +   * **IOMAP_F_SHARED**: The space under the mapping is shared.
> >> > +     Copy on write is necessary to avoid corrupting other file data.
> >> > +
> >> > +   * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
> >> > +     heads for pagecache operations.
> >> > +     Do not add more uses of this.
> >> > +
> >> > +   * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
> >> > +     coalesced into this single mapping.
> >> > +     This is only useful for FIEMAP.
> >> > +
> >> > +   * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
> >> > +     regular file data.
> >> > +     This is only useful for FIEMAP.
> >> > +
> >> > +   * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can
> >> > +     be set by the filesystem for its own purposes.
> >> > +
> >> > +   These flags can be set by iomap itself during file operations.
> >> > +   The filesystem should supply an ``->iomap_end`` function to observe
> >> > +   these flags:
> >> > +
> >> > +   * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
> >> > +     using this mapping.
> >> > +
> >> > +   * **IOMAP_F_STALE**: The mapping was found to be stale.
> >> > +     iomap will call ``->iomap_end`` on this mapping and then
> >> > +     ``->iomap_begin`` to obtain a new mapping.
> >> > +
> >> > +   Currently, these flags are only set by pagecache operations.
> >> > +
> >> > + * ``addr`` describes the device address, in bytes.
> >> > +
> >> > + * ``bdev`` describes the block device for this mapping.
> >> > +   This only needs to be set for mapped or unwritten operations.
> >> > +
> >> > + * ``dax_dev`` describes the DAX device for this mapping.
> >> > +   This only needs to be set for mapped or unwritten operations, and
> >> > +   only for a fsdax operation.
> >> 
> >> Looks like we can make this union (bdev and dax_dev). Since depending
> >> upon IOMAP_DAX or not we only set either dax_dev or bdev.
> >
> > Separate patch. ;)
> >
> 
> Yes, in a way I was trying to get an opinion from you and others on
> whether it make sense to make bdev and dax_dev as union :)
> 
> Looks like this series [1] could be the reason for that. 
> 
> [1]: https://lore.kernel.org/all/20211129102203.2243509-1-hch@xxxxxx/#t
> 
> I also don't see any reference to dax code from fs/iomap/buffered-io.c
> So maybe we don't need this dax.h header in this file.
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c5802a459334..e1a6cca3cec2 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -10,7 +10,6 @@
>  #include <linux/pagemap.h>
>  #include <linux/uio.h>
>  #include <linux/buffer_head.h>
> -#include <linux/dax.h>
>  #include <linux/writeback.h>
>  #include <linux/list_sort.h>
>  #include <linux/swap.h>

Yes, given that both you and hch have mentioned it, could one of you
send a cleanup series for that?

> 
> >> Sorry Darrick. I will stop here for now.
> >> I will continue it from here later.
> >
> > Ok.  The rest of the doc will hopefully make it more obvious why there's
> > the generic discussion up here.
> >
> 
> Sure. I am going through it.

<nod>

--D

> -ritesh
> 




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux