Re: [RFC] Documentation: Add initial iomap document

Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> · Wed, 05 Jun 2024 09:37:51 +0530

"Darrick J. Wong" <djwong@xxxxxxxxxx> writes:

> On Fri, May 03, 2024 at 07:40:19PM +0530, Ritesh Harjani (IBM) wrote:
>> This adds an initial first draft of iomap documentation. Hopefully this
>> will come useful to those who are looking for converting their
>> filesystems to iomap. Currently this is in text format since this is the
>> first draft. I would prefer to work on it's conversion to .rst once we
>> receive the feedback/review comments on the overall content of the document.
>> But feel free to let me know if we prefer it otherwise.
>> 
>> A lot of this has been collected from various email conversations, code
>> comments, commit messages and/or my own understanding of iomap. Please
>> note a large part of this has been taken from Dave's reply to last iomap
>> doc patchset. Thanks to Dave, Darrick, Matthew, Christoph and other iomap
>> developers who have taken time to explain the iomap design in various emails,
>> commits, comments etc.
>> 
>> Please note that this is not the complete iomap design doc. but a brief
>> overview of iomap.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
>> ---
>>  Documentation/filesystems/index.rst |   1 +
>>  Documentation/filesystems/iomap.txt | 289 ++++++++++++++++++++++++++++
>>  MAINTAINERS                         |   1 +
>>  3 files changed, 291 insertions(+)
>>  create mode 100644 Documentation/filesystems/iomap.txt
>> 
>> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
>> index 1f9b4c905a6a..c17b5a2ec29b 100644
>> --- a/Documentation/filesystems/index.rst
>> +++ b/Documentation/filesystems/index.rst
>> @@ -34,6 +34,7 @@ algorithms work.
>>     seq_file
>>     sharedsubtree
>>     idmappings
>> +   iomap
>> 
>>     automount-support
>> 
>> diff --git a/Documentation/filesystems/iomap.txt b/Documentation/filesystems/iomap.txt
>> new file mode 100644
>> index 000000000000..4f766b129975
>> --- /dev/null
>> +++ b/Documentation/filesystems/iomap.txt
>> @@ -0,0 +1,289 @@
>> +Introduction
>> +============
>> +iomap is a filesystem centric mapping layer that maps file's logical offset
>
> It's really more of a library for filesystem implementations that need
> to handle file operations that involve physical storage devices.
>

Maybe something like this...

iomap is a filesystem library for handling various filesystem operations
that involves mapping of file's logical offset ranges to physical extents.

>> +ranges to physical extents. It provides several iterator APIs which filesystems
>
> Minor style nit: Start each sentence on a new line, so that future
> revisions to a sentence don't produce diff for adjoining sentences.
>

I will try to keep that in mind wherever possible.

>> +can use for doing various file_operations, address_space_operations,
>
> "...for doing various file and pagecache operations, such as:
>
>  * Pagecache reads and writes
>  * Page faults
>  * Dirty pagecache writeback
>  * FIEMAP
>  * directio
>  * lseek
>  * swapfile activation
> <etc>
>

Thanks this is more clear.

>> +vm_operations, inode_operations etc. It supports APIs for doing direct-io,
>> +buffered-io, lseek, dax-io, page-mkwrite, swap_activate and extent reporting
>> +via fiemap.
>> +
>> +iomap is termed above as filesystem centric because it first calls
>> +->iomap_begin() phase supplied by the filesystem to get a mapped extent and
>> +then loops over each folio within that mapped extent.
>> +This is useful for filesystems because now they can allocate/reserve a much
>> +larger extent at begin phase v/s the older approach of doing block allocation
>> +of one block at a time by calling filesystem's provided ->get_blocks() routine.
>
> I think this should be more general because all the iomap operations
> work this way, not just the ones that involve folios:
>
> "Unlike the classic Linux IO model which breaks file io into small units
> (generally memory pages or blocks) and looks up space mappings on the
> basis of that unit, the iomap model asks the filesystem for the largest
> space mappings that it can create for a given file operation and
> initiates operations on that basis.
> This strategy improves the filesystem's visibility into the size of the
> operation being performed, which enables it to combat fragmentation with
> larger space allocations when possible.
> Larger mappings improve runtime performance by amortizing the cost of a
> mapping function call into the filesystem across a larger amount of
> data."
>
> "At a high level, an iomap operation looks like this:
>
>   for each byte in the operation range,
>       obtain space mapping via ->iomap_begin
>       for each sub-unit of work,
>           revalidate the mapping
>           do the work
>       increment file range cursor
>       if needed, release the mapping via ->iomap_end
>

Thanks for making it more general. I will use your above version.

> "Each iomap operation will be covered in more detail below."
>
>> +
>> +i.e. at a high level how iomap does write iter is [1]::
>> +	user IO
>> +	  loop for file IO range
>> +	    loop for each mapped extent
>> +	      if (buffered) {
>> +		loop for each page/folio {
>> +		  instantiate page cache
>> +		  copy data to/from page cache
>> +		  update page cache state
>> +		}
>> +	      } else { /* direct IO */
>> +		loop for each bio {
>> +		  pack user pages into bio
>> +		  submit bio
>> +		}
>> +	      }
>> +	    }
>> +	  }
>
> I agree with the others that each io operation (pagecache, directio,
> etc) should be a separate section.
>

Actually the idea for adding this loop was mainly to show that
->iomap_begin operation first maps the extent and then we loop over each
folio. This is some sense is already covered in your paragraph above.
So I am not sure how much of a value it will add to add these loops for
individual operations. But I will re-asses that while doing v2.

>> +
>> +
>> +Motivation for filesystems to convert to iomap
>> +===============================================
>
> Maybe this is the section titled buffered IO?
>

No the idea here was to list the modern filesystem features iomap
library supports. e.g. large folios, upcoming atomic writes for
buffered-io, abstracts page cache operation completely...

But, I got your point. Let me find a better line for this section :)

>> +1. iomap is a modern filesystem mapping layer VFS abstraction.
>
> I don't think this is much of a justification -- buffer heads were
> modern once.
>

Sure depending upon what ends up being the section name. I will see what
needs to come here, or will drop it.

>> +2. It also supports large folios for buffered-writes. Large folios can help
>> +improve filesystem buffered-write performance and can also improve overall
>> +system performance.
>
> How does it improve pagecache performance?  Is this a result of needing
> to take fewer locks?  Is it because page table walks can stop early?  Is
> it because we can reduce the number of memcpy calls?
>

You mean to ask how does large folios in iomap improves buffered-write
performance & system performance?
- less no. of struct folio entries to iterate over while doing
buffered-I/O. 
- large folios in pagecache helps in reducing MM fragmentation.
- Smaller LRU lists size means, lesser work overhead for MM subsystems; means
better system performance.

I will go ahead and add these subpoints for point 2 then. 

>> +3. Less maintenance overhead for individual filesystem maintainers.
>> +iomap is able to abstract away common folio-cache related operations from the
>> +filesystem to within the iomap layer itself. e.g. allocating, instantiating,
>> +locking and unlocking of the folios for buffered-write operations are now taken
>> +care within iomap. No ->write_begin(), ->write_end() or direct_IO
>> +address_space_operations are required to be implemented by filesystem using
>> +iomap.
>
> I think it's stronger to say that iomap abstracts away /all/ the
> pagecache operations, which gets each fs implementation out of the
> business of doing that itself.
>

Sure. Make sense. Will do that.

>> +
>> +
>> +blocksize < pagesize path/large folios
>> +======================================
>
> Is this a subsection?  Should the annotation be "----" instead of "====" ?
>

Ok.

>> +Large folio support or systems with large pagesize e.g 64K on Power/ARM64 and
>> +4k blocksize, needs filesystems to support bs < ps paths. iomap embeds
>> +struct iomap_folio_state (ifs) within folio->private. ifs maintains uptodate
>> +and dirty bits for each subblock within the folio. Using ifs iomap can track
>> +update and dirty status of each block within the folio. This helps in supporting
>> +bs < ps path for such systems with large pagesize or with large folios [2].
>
> TBH I think it's sufficient to mention that the iomap pagecache
> implementation tracks uptodate and dirty status for each the fsblocks
> cached by a folio to reduce read and write amplification.  No need to go
> into the details of exactly how that happens.
>
> Perhaps also mention that variable sized folios are required to handle
> fsblock > pagesize filesystems.
>

Ok.

>> +
>> +
>> +struct iomap
>> +=============
>> +This structure defines a file mapping information of logical file offset range
>> +to a physical mapped extent on which an IO operation could be performed.
>> +An iomap reflects a single contiguous range of filesystem address space that
>> +either exists in memory or on a block device.
>
> "This structure contains information to map a range of file data to
> physical space on a storage device.
> The information is useful for translating file operations into action.
> The actions taken are specific to the target of the operation, such as
> disk cache, physical storage devices, or another part of the kernel."
>
> "Each iomap operation will pass operation flags as needed to the
> ->iomap_begin function.
> These flags will be documented in the operation-specific sections below.
> For a write operation, IOMAP_WRITE will be set.
> For any other operation, IOMAP_WRITE will not be set."
>
> ...and then down in the section on pagecache operations, you can do
> something like:
>
> "For pagecache operations, the @flags argument to ->iomap_begin can be
> any of the following:"
>
>  * IOMAP_WRITE: write to a file
>
>  * IOMAP_WRITE | IOMAP_UNSHARE: if any of the space backing a file is
>    shared, mark that part of the pagecache dirty so that it will be
>    unshared during writeback
>
>  * IOMAP_WRITE | IOMAP_FAULT: this is a write fault for mmapped
>    pagecache
>
>  * IOMAP_ZERO: zero parts of a file
>
>  * 0: read from the pagecache
>

Ok. Thanks for this.

> Ugh, this namespace collides with IOMAP_{HOLE..INLINE}.  Sigh.
>
>> +1. The type field within iomap determines what type the range maps to e.g.
>> +IOMAP_HOLE, IOMAP_DELALLOC, IOMAP_UNWRITTEN etc.
>
> Maybe have a list here stating what a "DELALLOC" is?  Not all readers
> are going to know what these types are.
>
>  * IOMAP_HOLE: No storage has been allocated.  This must never be
>    returned in response to an IOMAP_WRITE operation because writes must
>    allocate and map space, and return the mapping.
>
>  * IOMAP_DELALLOC: A promise to allocate space at a later time ("delayed
>    allocation").  If the filesystem returns IOMAP_F_NEW here and the
>    write fails, the ->iomap_end function must delete the reservation.
>
>  * IOMAP_MAPPED: The file range maps to this space on the storage
>    device.  The device is returned in @bdev or @dax_dev, and the device
>    address of the space is returned in @addr.
>
>  * IOMAP_UNWRITTEN: The file range maps to this space on the storage
>    device, but the space has not yet been initialized.  The device is
>    returned in @bdev or @dax_dev, and the device address of the space is
>    returned in @addr.  Reads will return zeroes, and the write ioend
>    must update the mapping to MAPPED.
>
>  * IOMAP_INLINE: The file range maps to the in-memory buffer
>    @inline_data.  For IOMAP_WRITE operations, the ->iomap_end function
>    presumable handles persisting the data.
>

I was thinking maybe it's corresponding file will already have that
that's why, so we need not cover it here.
And I also wanted to kept initial RFC document as a brief overview.
But sure thanks for doing this. I will add this info in v2.

>> +2. The flags field represent the state flags (e.g. IOMAP_F_*), most of which are
>> +set the by the filesystem during mapping time that indicates how iomap
>> +infrastructure should modify it's behaviour to do the right thing.
>
> I think you should summarize the IOMAP_F_ flags here.
>

Sure. Thanks!

>> +3. private void pointer within iomap allows the filesystems to pass filesystem's
>> +private data from ->iomap_begin() to ->iomap_end() [3].
>> +(see include/linux/iomap.h for more details)
>
> 4. Maybe we should mention what bdev/dax_dev/inline_data do here?
>

Let me read about this.

> 5. What does the validity cookie do?
>

Sure, I will add that.

>> +
>> +
>> +iomap operations
>> +================
>> +iomap provides different iterator APIs for direct-io, buffered-io, lseek,
>> +dax-io, page-mkwrite, swap_activate and extent reporting via fiemap. It requires
>> +various struct operations to be prepared by filesystem and to be supplied to
>> +iomap iterator APIs either at the beginning of iomap api call or attaching it
>> +during the mapping callback time e.g iomap_folio_ops is attached to
>> +iomap->folio_ops during ->iomap_begin() call.
>> +
>> +Following provides various ops to be supplied by filesystems to iomap layer for
>> +doing different I/O types as discussed above.
>> +
>> +iomap_ops: IO interface specific operations
>> +==========
>> +The methods are designed to be used as pairs. The begin method creates the iomap
>> +and attaches all the necessary state and information which subsequent iomap
>> +methods & their callbacks might need. Once the iomap infrastructure has finished
>> +working on the iomap it will call the end method to allow the filesystem to tear
>> +down any unused space and/or structures it created for the specific iomap
>> +context.
>> +
>> +Almost all iomap iterator APIs require filesystems to define iomap_ops so that
>> +filesystems can be called into for providing logical to physical extent mapping,
>> +wherever required. This is required by the iomap iter apis used for the
>> +operations which are listed in the beginning of "iomap operations" section.
>> +  - iomap_begin: This either returns an existing mapping or reserve/allocates a
>> +    new mapping when called by iomap. pos and length are passed as function
>> +    arguments. Filesystem returns the new mapping information within struct
>> +    iomap which also gets passed as a function argument. Filesystems should
>> +    provide the type of this extent in iomap->type for e.g. IOMAP_HOLE,
>> +    IOMAP_UNWRITTEN and it should set the iomap->flags e.g. IOMAP_F_*
>> +    (see details in include/linux/iomap.h)
>> +
>> +    Note that iomap_begin() call has srcmap passed as another argument. This is
>> +    mainly used only during the begin phase for COW mappings to identify where
>> +    the reads are to be performed from. Filesystems needs to fill that mapping
>> +    information if iomap should read data for partially written blocks from a
>> +    different location than the write target [4].
>> +
>> +  - iomap_end: Commit and/or unreserve space which was previously allocated
>> +    using iomap_begin. During buffered-io, when a short writes occurs,
>> +    filesystem may need to remove the reserved space that was allocated
>> +    during ->iomap_begin. For filesystems that use delalloc allocation, we need
>> +    to punch out delalloc extents from the range that are not dirty in the page
>> +    cache. See comments in iomap_file_buffered_write_punch_delalloc() for more
>> +    info [5][6].
>
> I think these three sections "struct iomap", "iomap operations", and
> "iomap_ops" should come first before you talk about specific things like
> buffered io operations.
>

Ok. 

> I'm going to stop here for the day.
>
> --D
>

Thanks Darrick for taking time to do the extensive review. 

I will take yours and others review into account and will work on v2.

-ritesh