Re: [RFC] Documentation: Add initial iomap document

Jan Kara <jack@xxxxxxx> · Sun, 12 May 2024 14:44:57 +0200

On Fri 03-05-24 19:40:19, Ritesh Harjani (IBM) wrote:
> This adds an initial first draft of iomap documentation. Hopefully this
> will come useful to those who are looking for converting their
> filesystems to iomap. Currently this is in text format since this is the
> first draft. I would prefer to work on it's conversion to .rst once we
> receive the feedback/review comments on the overall content of the document.
> But feel free to let me know if we prefer it otherwise.
> 
> A lot of this has been collected from various email conversations, code
> comments, commit messages and/or my own understanding of iomap. Please
> note a large part of this has been taken from Dave's reply to last iomap
> doc patchset. Thanks to Dave, Darrick, Matthew, Christoph and other iomap
> developers who have taken time to explain the iomap design in various emails,
> commits, comments etc.
> 
> Please note that this is not the complete iomap design doc. but a brief
> overview of iomap.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
> ---
>  Documentation/filesystems/index.rst |   1 +
>  Documentation/filesystems/iomap.txt | 289 ++++++++++++++++++++++++++++
>  MAINTAINERS                         |   1 +
>  3 files changed, 291 insertions(+)
>  create mode 100644 Documentation/filesystems/iomap.txt
> 
> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
> index 1f9b4c905a6a..c17b5a2ec29b 100644
> --- a/Documentation/filesystems/index.rst
> +++ b/Documentation/filesystems/index.rst
> @@ -34,6 +34,7 @@ algorithms work.
>     seq_file
>     sharedsubtree
>     idmappings
> +   iomap
> 
>     automount-support
> 
> diff --git a/Documentation/filesystems/iomap.txt b/Documentation/filesystems/iomap.txt
> new file mode 100644
> index 000000000000..4f766b129975
> --- /dev/null
> +++ b/Documentation/filesystems/iomap.txt
> @@ -0,0 +1,289 @@
> +Introduction
> +============
> +iomap is a filesystem centric mapping layer that maps file's logical offset
> +ranges to physical extents. It provides several iterator APIs which filesystems
> +can use for doing various file_operations, address_space_operations,
> +vm_operations, inode_operations etc. It supports APIs for doing direct-io,
> +buffered-io, lseek, dax-io, page-mkwrite, swap_activate and extent reporting
> +via fiemap.
> +
> +iomap is termed above as filesystem centric because it first calls
> +->iomap_begin() phase supplied by the filesystem to get a mapped extent and
                   ^^^^ 'phase' sounds strange to me here. 'callback' or
'method'?

> +then loops over each folio within that mapped extent.
> +This is useful for filesystems because now they can allocate/reserve a much
> +larger extent at begin phase v/s the older approach of doing block allocation
> +of one block at a time by calling filesystem's provided ->get_blocks() routine.
> +
> +i.e. at a high level how iomap does write iter is [1]::
> +	user IO
> +	  loop for file IO range
> +	    loop for each mapped extent
> +	      if (buffered) {
> +		loop for each page/folio {
> +		  instantiate page cache
> +		  copy data to/from page cache
> +		  update page cache state
> +		}
> +	      } else { /* direct IO */
> +		loop for each bio {
> +		  pack user pages into bio
> +		  submit bio
> +		}
> +	      }
> +	    }
> +	  }

I agree with Christoph that this would be better split to buffer and direct
IO parts. In particular because the combined handler like shown above does
not exist in iomap...

> +Motivation for filesystems to convert to iomap
> +===============================================
> +1. iomap is a modern filesystem mapping layer VFS abstraction.
> +2. It also supports large folios for buffered-writes. Large folios can help
> +improve filesystem buffered-write performance and can also improve overall
> +system performance.
> +3. Less maintenance overhead for individual filesystem maintainers.
> +iomap is able to abstract away common folio-cache related operations from the
> +filesystem to within the iomap layer itself. e.g. allocating, instantiating,
> +locking and unlocking of the folios for buffered-write operations are now taken
> +care within iomap. No ->write_begin(), ->write_end() or direct_IO
      ^^^ of						  ^^ ->direct_IO

> +address_space_operations are required to be implemented by filesystem using
> +iomap.
> +
> +
> +blocksize < pagesize path/large folios
> +======================================
> +Large folio support or systems with large pagesize e.g 64K on Power/ARM64 and
						      ^^^ e.g.

> +4k blocksize, needs filesystems to support bs < ps paths. iomap embeds
> +struct iomap_folio_state (ifs) within folio->private. ifs maintains uptodate
> +and dirty bits for each subblock within the folio. Using ifs iomap can track
> +update and dirty status of each block within the folio. This helps in supporting
> +bs < ps path for such systems with large pagesize or with large folios [2].
> +
> +
> +struct iomap
> +=============
> +This structure defines a file mapping information of logical file offset range
> +to a physical mapped extent on which an IO operation could be performed.
> +An iomap reflects a single contiguous range of filesystem address space that
> +either exists in memory or on a block device.
> +1. The type field within iomap determines what type the range maps to e.g.
> +IOMAP_HOLE, IOMAP_DELALLOC, IOMAP_UNWRITTEN etc.
> +
> +2. The flags field represent the state flags (e.g. IOMAP_F_*), most of which are
> +set the by the filesystem during mapping time that indicates how iomap
> +infrastructure should modify it's behaviour to do the right thing.
> +
> +3. private void pointer within iomap allows the filesystems to pass filesystem's
> +private data from ->iomap_begin() to ->iomap_end() [3].
> +(see include/linux/iomap.h for more details)
> +
> +
> +iomap operations
> +================
> +iomap provides different iterator APIs for direct-io, buffered-io, lseek,
> +dax-io, page-mkwrite, swap_activate and extent reporting via fiemap. It requires
> +various struct operations to be prepared by filesystem and to be supplied to
> +iomap iterator APIs either at the beginning of iomap api call or attaching it
> +during the mapping callback time e.g iomap_folio_ops is attached to
				    ^^^ e.g.

> +iomap->folio_ops during ->iomap_begin() call.
> +
> +Following provides various ops to be supplied by filesystems to iomap layer for
> +doing different I/O types as discussed above.
   ^^^ performing

> +iomap_ops: IO interface specific operations
> +==========
> +The methods are designed to be used as pairs. The begin method creates the iomap
						     ^^^ I'd use the full
name here as it's the first occurence in the section. Hence 'iomap_begin'.

> +and attaches all the necessary state and information which subsequent iomap
> +methods & their callbacks might need. Once the iomap infrastructure has finished
> +working on the iomap it will call the end method to allow the filesystem to tear
					 ^^ iomap_end

> +down any unused space and/or structures it created for the specific iomap
> +context.
> +
> +Almost all iomap iterator APIs require filesystems to define iomap_ops so that
> +filesystems can be called into for providing logical to physical extent mapping,
> +wherever required. This is required by the iomap iter apis used for the
> +operations which are listed in the beginning of "iomap operations" section.
> +  - iomap_begin: This either returns an existing mapping or reserve/allocates a
								^^^ reserves

> +    new mapping when called by iomap. pos and length are passed as function
> +    arguments. Filesystem returns the new mapping information within struct
> +    iomap which also gets passed as a function argument. Filesystems should
> +    provide the type of this extent in iomap->type for e.g. IOMAP_HOLE,
> +    IOMAP_UNWRITTEN and it should set the iomap->flags e.g. IOMAP_F_*
> +    (see details in include/linux/iomap.h)
> +
> +    Note that iomap_begin() call has srcmap passed as another argument. This is
> +    mainly used only during the begin phase for COW mappings to identify where
> +    the reads are to be performed from. Filesystems needs to fill that mapping
							^^ need

> +    information if iomap should read data for partially written blocks from a
> +    different location than the write target [4].
> +
> +  - iomap_end: Commit and/or unreserve space which was previously allocated
> +    using iomap_begin. During buffered-io, when a short writes occurs,
> +    filesystem may need to remove the reserved space that was allocated
> +    during ->iomap_begin. For filesystems that use delalloc allocation, we need
							^^^ delayed

> +    to punch out delalloc extents from the range that are not dirty in the page
> +    cache. See comments in iomap_file_buffered_write_punch_delalloc() for more
> +    info [5][6].
> +
> +iomap_dio_ops: Direct I/O operations structure for iomap.
> +=============
> +This gets passed with iomap_dio_rw(), so that iomap can call certain operations
> +before submission or on completion of DIRECT_IO.
					 ^^^ I'd use just 'direct IO' or
'DIO'

> +  - end_io: Required after bio completion for e.g. for conversion of unwritten
               ^^^ Well, the callback isn't really required AFAICT. So I'd
write: 'Method called after bio completion. Can be used for example for
conversion of unwritten extents."

> +    extents.
> +
> +  - submit_io: This hook is required for e.g. by filesystems like btrfs who
> +    would like to do things like data replication for fs-handled RAID.

Again somewhat hard to understand for me. How about: "Optional method to be
called by iomap instead of simply calling submit_bio(). Useful for example
for filesystems wanting to do data replication on submission of IO."

> +
> +  - bio_set: This allows the filesystem to provide custom bio_set for allocating
> +    direct I/O bios. This will allow the filesystem who uses ->submit_io hook to
							^^ which

> +    stash away additional information for filesystem use. Filesystems will
									 ^^^
"In case the filesystem will provide their custom ->bi_end_io function, it
needs to call iomap_dio_bio_end_io() from the custom handler to handle dio
completion [11]".

Also I'd move this note to the submit_io part above as it mostly relates to
that AFAIU.

> +    provide their custom ->bi_end_io function completion which should then call
> +    into iomap_dio_bio_end_io() for dio completion [11].
> +
> +iomap_writeback_ops: Writeback operations structure for iomap
> +====================
> +Writeback address space operations e.g. iomap_writepages(), requires the
> +filesystem to pass this ops field.
> +   - map_blocks: map the blocks at the writeback time. This is called once per
> +     folio. Filesystems can return an existing mapping from a previous call if
> +     that mapping is still valid. This can race with paths which can invalidate
> +     previous mappings such as fallocate/truncate. Hence filesystems must have
> +     a mechanism by which it can validate if the previous mapping provided is
			     ^^ they

> +     still valid. Filesystems might need a per inode seq counter which can be
> +     used to verify if the underlying mapping of logical to physical blocks
> +     has changed since the last ->map_blocks call or not.
> +     They can then use wpc->iomap->validity_cookie to cache their seq count in
> +     ->map_blocks call [6].
> +
> +  - prepare_ioend: Allows filesystems to process the extents before submission
> +    for e.g. convert COW extents to regular. This also allows filesystem to
> +    hook in a custom completion handler for processing bio completion e.g.
> +    conversion of unwritten extents.
> +    Note that ioends might need to be processed as an atomic completion unit
> +    (using transactions) when all the chained bios in the ioend have completed
> +    (e.g. for conversion of unwritten extents). iomap provides some helper
> +    methods for ioend merging and completion [12]. Look at comments in
> +    xfs_end_io() routine for more info.
> +
> +  - discard_folio: In case if the filesystem has any delalloc blocks on it,
> +    then those needs to be punched out in this call. Otherwise, it may leave a
> +    stale delalloc mapping covered by a clean page that needs to be dirtied
> +    again before the delalloc mapping can be converted. This stale delalloc
> +    mapping can trip the direct I/O reads when done on the same region [7].

Here I miss explanation when iomap calls this callback. Apparently when
mapping of folio for writeback fails for some reason.

> +iomap_folio_ops: Folio related operations structure for iomap.
> +================
> +When filesystem sets folio_ops in an iomap mapping it returns, ->get_folio()
> +and ->put_folio() will be called for each folio written to during write iter
> +time of buffered writes.
> +  - get_folio: iomap will call ->get_folio() for every folio of the returned
> +    iomap mapping. Currently gfs2 uses this to start the transaction before
> +    taking the folio lock [8].
> +
> +  - put_folio: iomap will call ->put_folio() once the data has been written to
									^^^
copied into the folio for each folio of the returned iomap mapping.

> +    for each folio of the returned iomap mapping. GFS2 uses this to add data
> +    bufs to the transaction before unlocking the folio and then ending the
> +    transaction [9].
> +
> +  - iomap_valid: Filesystem internal extent map can change while iomap is
		    ^^^ In case filesystem internal extent map ...

> +    iterating each folio of a cached iomap, so this hook allows iomap to detect
> +    that the iomap needs to be refreshed during a long running write operation.
> +    Filesystems can store an internal state (e.g. a sequence no.) in
> +    iomap->validity_cookie when the iomap is first mapped, to be able to detect
> +    changes between the mapping time and whenever iomap calls ->iomap_valid().
> +    This gets called with the locked folio. See iomap_write_begin() for more
> +    comments around ->iomap_valid() [10].
> +
> +
> +Locking
> +========
> +iomap assumes two layers of locking. It requires locking above the iomap layer
> +for IO serialisation (i_rwsem, invalidation lock) which is generally taken
> +before calling into iomap iter functions. There is also locking below iomap for
> +mapping/allocation serialisation on an inode (e.g. XFS_ILOCK or i_data_sem in
> +ext4 etc) that is usually taken inside the mapping methods which filesystems
> +supplied to the iomap infrastructure. This layer of locking needs to be
> +independent of the IO path serialisation locking as it nests inside in the IO
> +path but is also used without the filesystem IO path locking protecting it
> +(e.g. in the iomap writeback path).
> +
> +General Locking order in iomap is:
> +inode->i_rwsem (shared or exclusive)
> +  inode->i_mapping->invalidate_lock (exclusive)
> +    folio_lock()
> +	internal filesystem allocation lock (e.g. XFS_ILOCK or i_data_sem)
> +
> +
> +Zeroing/Truncate Operations
> +===========================
> +Filesystems can use iomap provided helper functions e.g. iomap_zero_range(),
> +iomap_truncate_page() & iomap_file_unshare() for various truncate/fallocate or
> +any other similar operations that requires zeroing/truncate.
				     ^^^ require

> +See above functions for more details on how these can be used by individual
> +filesystems.
> +
> +
> +Guideline for filesystem conversion to iomap
> +=============================================
> +The right approach is to first implement ->iomap_begin and (if necessary)
> +->iomap_end to allow iomap to obtain a read-only mapping of a file range.  In
> +most cases, this is a relatively trivial conversion of the existing get_block()
> +callback for read-only mappings.
> +
> +i.e. rewrite the filesystem's get_block(create = false) implementation to use
  ^^^ I.e.

> +the new ->iomap_begin() implementation. i.e. get_block wraps around the outside
					   ^^^ I don't understand this
sentence...

> +and converts the information from bufferhead-based map to what iomap expects.
> +This will convert all the existing read-only mapping users to use the new iomap
> +mapping function internally. This way the iomap mapping function can be further
> +tested without needing to implement any other iomap APIs.
> +
> +FIEMAP operation is a really good first target because it is trivial to
> +implement support for it and then to determine that the extent map iteration is
> +correct from userspace. i.e. if FIEMAP is returning the correct information,
			   ^^ I.e.,

> +it's a good sign that other read-only mapping operations will also do the right
> +thing.
> +
> +Once everything is working like this, then convert all the other read-only
> +mapping operations to use iomap. Done one at a time, regressions should be self
> +evident. The only likely complexity at this point will be the buffered read IO
> +path because of bufferheads. The buffered read IO paths doesn't need to be
> +converted yet, though the direct IO read path should be converted in this phase.
> +
> +The next thing to do is implement get_blocks(create = true) functionality in the
> +->iomap_begin/end() methods. Then convert the direct IO write path to iomap, and
> +start running fsx w/ DIO enabled in earnest on filesystem. This will flush out
> +lots of data integrity corner case bug that the new write mapping implementation
				      ^^^ bugs
> +introduces.
> +
> +(TODO - get more info on this from Dave): At this point, converting the entire
> +get_blocks() path to call the iomap functions and convert the iomaps to
> +bufferhead maps is possible. This will get the entire filesystem using the new
					  ^^^ make		    ^^ use

> +mapping functions, and they should largely be debugged and working correctly
> +after this step.
> +
> +This now largely leaves the buffered read and write paths to be converted. The
> +mapping functions should all work correctly, so all that needs to be done is
> +rewriting all the code that interfaces with bufferheads to interface with iomap
> +and folios. It is rather easier first to get regular file I/O (without any
> +fancy feature like fscrypt, fsverity, data=journaling) converted to use iomap
> +and then work on directory handling conversion to iomap.

Well, I belive this comment about directories is pretty much specific for
ext2. Iomap never attempted to provide any infrastructure for metadata
handling - each filesystem is on its own here. ext2 is kind of special as
it was experimenting with handling directories more like data. In
retrospect I don't think it was particularly successful experiment but you
never know until you try :).

> +
> +The rest is left as an exercise for the reader, as it will be different for
> +every filesystem.

								Honza

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR