On Fri 03-05-24 19:40:19, Ritesh Harjani (IBM) wrote: > This adds an initial first draft of iomap documentation. Hopefully this > will come useful to those who are looking for converting their > filesystems to iomap. Currently this is in text format since this is the > first draft. I would prefer to work on it's conversion to .rst once we > receive the feedback/review comments on the overall content of the document. > But feel free to let me know if we prefer it otherwise. > > A lot of this has been collected from various email conversations, code > comments, commit messages and/or my own understanding of iomap. Please > note a large part of this has been taken from Dave's reply to last iomap > doc patchset. Thanks to Dave, Darrick, Matthew, Christoph and other iomap > developers who have taken time to explain the iomap design in various emails, > commits, comments etc. > > Please note that this is not the complete iomap design doc. but a brief > overview of iomap. > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> > --- > Documentation/filesystems/index.rst | 1 + > Documentation/filesystems/iomap.txt | 289 ++++++++++++++++++++++++++++ > MAINTAINERS | 1 + > 3 files changed, 291 insertions(+) > create mode 100644 Documentation/filesystems/iomap.txt > > diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst > index 1f9b4c905a6a..c17b5a2ec29b 100644 > --- a/Documentation/filesystems/index.rst > +++ b/Documentation/filesystems/index.rst > @@ -34,6 +34,7 @@ algorithms work. > seq_file > sharedsubtree > idmappings > + iomap > > automount-support > > diff --git a/Documentation/filesystems/iomap.txt b/Documentation/filesystems/iomap.txt > new file mode 100644 > index 000000000000..4f766b129975 > --- /dev/null > +++ b/Documentation/filesystems/iomap.txt > @@ -0,0 +1,289 @@ > +Introduction > +============ > +iomap is a filesystem centric mapping layer that maps file's logical offset > +ranges to physical extents. It provides several iterator APIs which filesystems > +can use for doing various file_operations, address_space_operations, > +vm_operations, inode_operations etc. It supports APIs for doing direct-io, > +buffered-io, lseek, dax-io, page-mkwrite, swap_activate and extent reporting > +via fiemap. > + > +iomap is termed above as filesystem centric because it first calls > +->iomap_begin() phase supplied by the filesystem to get a mapped extent and ^^^^ 'phase' sounds strange to me here. 'callback' or 'method'? > +then loops over each folio within that mapped extent. > +This is useful for filesystems because now they can allocate/reserve a much > +larger extent at begin phase v/s the older approach of doing block allocation > +of one block at a time by calling filesystem's provided ->get_blocks() routine. > + > +i.e. at a high level how iomap does write iter is [1]:: > + user IO > + loop for file IO range > + loop for each mapped extent > + if (buffered) { > + loop for each page/folio { > + instantiate page cache > + copy data to/from page cache > + update page cache state > + } > + } else { /* direct IO */ > + loop for each bio { > + pack user pages into bio > + submit bio > + } > + } > + } > + } I agree with Christoph that this would be better split to buffer and direct IO parts. In particular because the combined handler like shown above does not exist in iomap... > +Motivation for filesystems to convert to iomap > +=============================================== > +1. iomap is a modern filesystem mapping layer VFS abstraction. > +2. It also supports large folios for buffered-writes. Large folios can help > +improve filesystem buffered-write performance and can also improve overall > +system performance. > +3. Less maintenance overhead for individual filesystem maintainers. > +iomap is able to abstract away common folio-cache related operations from the > +filesystem to within the iomap layer itself. e.g. allocating, instantiating, > +locking and unlocking of the folios for buffered-write operations are now taken > +care within iomap. No ->write_begin(), ->write_end() or direct_IO ^^^ of ^^ ->direct_IO > +address_space_operations are required to be implemented by filesystem using > +iomap. > + > + > +blocksize < pagesize path/large folios > +====================================== > +Large folio support or systems with large pagesize e.g 64K on Power/ARM64 and ^^^ e.g. > +4k blocksize, needs filesystems to support bs < ps paths. iomap embeds > +struct iomap_folio_state (ifs) within folio->private. ifs maintains uptodate > +and dirty bits for each subblock within the folio. Using ifs iomap can track > +update and dirty status of each block within the folio. This helps in supporting > +bs < ps path for such systems with large pagesize or with large folios [2]. > + > + > +struct iomap > +============= > +This structure defines a file mapping information of logical file offset range > +to a physical mapped extent on which an IO operation could be performed. > +An iomap reflects a single contiguous range of filesystem address space that > +either exists in memory or on a block device. > +1. The type field within iomap determines what type the range maps to e.g. > +IOMAP_HOLE, IOMAP_DELALLOC, IOMAP_UNWRITTEN etc. > + > +2. The flags field represent the state flags (e.g. IOMAP_F_*), most of which are > +set the by the filesystem during mapping time that indicates how iomap > +infrastructure should modify it's behaviour to do the right thing. > + > +3. private void pointer within iomap allows the filesystems to pass filesystem's > +private data from ->iomap_begin() to ->iomap_end() [3]. > +(see include/linux/iomap.h for more details) > + > + > +iomap operations > +================ > +iomap provides different iterator APIs for direct-io, buffered-io, lseek, > +dax-io, page-mkwrite, swap_activate and extent reporting via fiemap. It requires > +various struct operations to be prepared by filesystem and to be supplied to > +iomap iterator APIs either at the beginning of iomap api call or attaching it > +during the mapping callback time e.g iomap_folio_ops is attached to ^^^ e.g. > +iomap->folio_ops during ->iomap_begin() call. > + > +Following provides various ops to be supplied by filesystems to iomap layer for > +doing different I/O types as discussed above. ^^^ performing > +iomap_ops: IO interface specific operations > +========== > +The methods are designed to be used as pairs. The begin method creates the iomap ^^^ I'd use the full name here as it's the first occurence in the section. Hence 'iomap_begin'. > +and attaches all the necessary state and information which subsequent iomap > +methods & their callbacks might need. Once the iomap infrastructure has finished > +working on the iomap it will call the end method to allow the filesystem to tear ^^ iomap_end > +down any unused space and/or structures it created for the specific iomap > +context. > + > +Almost all iomap iterator APIs require filesystems to define iomap_ops so that > +filesystems can be called into for providing logical to physical extent mapping, > +wherever required. This is required by the iomap iter apis used for the > +operations which are listed in the beginning of "iomap operations" section. > + - iomap_begin: This either returns an existing mapping or reserve/allocates a ^^^ reserves > + new mapping when called by iomap. pos and length are passed as function > + arguments. Filesystem returns the new mapping information within struct > + iomap which also gets passed as a function argument. Filesystems should > + provide the type of this extent in iomap->type for e.g. IOMAP_HOLE, > + IOMAP_UNWRITTEN and it should set the iomap->flags e.g. IOMAP_F_* > + (see details in include/linux/iomap.h) > + > + Note that iomap_begin() call has srcmap passed as another argument. This is > + mainly used only during the begin phase for COW mappings to identify where > + the reads are to be performed from. Filesystems needs to fill that mapping ^^ need > + information if iomap should read data for partially written blocks from a > + different location than the write target [4]. > + > + - iomap_end: Commit and/or unreserve space which was previously allocated > + using iomap_begin. During buffered-io, when a short writes occurs, > + filesystem may need to remove the reserved space that was allocated > + during ->iomap_begin. For filesystems that use delalloc allocation, we need ^^^ delayed > + to punch out delalloc extents from the range that are not dirty in the page > + cache. See comments in iomap_file_buffered_write_punch_delalloc() for more > + info [5][6]. > + > +iomap_dio_ops: Direct I/O operations structure for iomap. > +============= > +This gets passed with iomap_dio_rw(), so that iomap can call certain operations > +before submission or on completion of DIRECT_IO. ^^^ I'd use just 'direct IO' or 'DIO' > + - end_io: Required after bio completion for e.g. for conversion of unwritten ^^^ Well, the callback isn't really required AFAICT. So I'd write: 'Method called after bio completion. Can be used for example for conversion of unwritten extents." > + extents. > + > + - submit_io: This hook is required for e.g. by filesystems like btrfs who > + would like to do things like data replication for fs-handled RAID. Again somewhat hard to understand for me. How about: "Optional method to be called by iomap instead of simply calling submit_bio(). Useful for example for filesystems wanting to do data replication on submission of IO." > + > + - bio_set: This allows the filesystem to provide custom bio_set for allocating > + direct I/O bios. This will allow the filesystem who uses ->submit_io hook to ^^ which > + stash away additional information for filesystem use. Filesystems will ^^^ "In case the filesystem will provide their custom ->bi_end_io function, it needs to call iomap_dio_bio_end_io() from the custom handler to handle dio completion [11]". Also I'd move this note to the submit_io part above as it mostly relates to that AFAIU. > + provide their custom ->bi_end_io function completion which should then call > + into iomap_dio_bio_end_io() for dio completion [11]. > + > +iomap_writeback_ops: Writeback operations structure for iomap > +==================== > +Writeback address space operations e.g. iomap_writepages(), requires the > +filesystem to pass this ops field. > + - map_blocks: map the blocks at the writeback time. This is called once per > + folio. Filesystems can return an existing mapping from a previous call if > + that mapping is still valid. This can race with paths which can invalidate > + previous mappings such as fallocate/truncate. Hence filesystems must have > + a mechanism by which it can validate if the previous mapping provided is ^^ they > + still valid. Filesystems might need a per inode seq counter which can be > + used to verify if the underlying mapping of logical to physical blocks > + has changed since the last ->map_blocks call or not. > + They can then use wpc->iomap->validity_cookie to cache their seq count in > + ->map_blocks call [6]. > + > + - prepare_ioend: Allows filesystems to process the extents before submission > + for e.g. convert COW extents to regular. This also allows filesystem to > + hook in a custom completion handler for processing bio completion e.g. > + conversion of unwritten extents. > + Note that ioends might need to be processed as an atomic completion unit > + (using transactions) when all the chained bios in the ioend have completed > + (e.g. for conversion of unwritten extents). iomap provides some helper > + methods for ioend merging and completion [12]. Look at comments in > + xfs_end_io() routine for more info. > + > + - discard_folio: In case if the filesystem has any delalloc blocks on it, > + then those needs to be punched out in this call. Otherwise, it may leave a > + stale delalloc mapping covered by a clean page that needs to be dirtied > + again before the delalloc mapping can be converted. This stale delalloc > + mapping can trip the direct I/O reads when done on the same region [7]. Here I miss explanation when iomap calls this callback. Apparently when mapping of folio for writeback fails for some reason. > +iomap_folio_ops: Folio related operations structure for iomap. > +================ > +When filesystem sets folio_ops in an iomap mapping it returns, ->get_folio() > +and ->put_folio() will be called for each folio written to during write iter > +time of buffered writes. > + - get_folio: iomap will call ->get_folio() for every folio of the returned > + iomap mapping. Currently gfs2 uses this to start the transaction before > + taking the folio lock [8]. > + > + - put_folio: iomap will call ->put_folio() once the data has been written to ^^^ copied into the folio for each folio of the returned iomap mapping. > + for each folio of the returned iomap mapping. GFS2 uses this to add data > + bufs to the transaction before unlocking the folio and then ending the > + transaction [9]. > + > + - iomap_valid: Filesystem internal extent map can change while iomap is ^^^ In case filesystem internal extent map ... > + iterating each folio of a cached iomap, so this hook allows iomap to detect > + that the iomap needs to be refreshed during a long running write operation. > + Filesystems can store an internal state (e.g. a sequence no.) in > + iomap->validity_cookie when the iomap is first mapped, to be able to detect > + changes between the mapping time and whenever iomap calls ->iomap_valid(). > + This gets called with the locked folio. See iomap_write_begin() for more > + comments around ->iomap_valid() [10]. > + > + > +Locking > +======== > +iomap assumes two layers of locking. It requires locking above the iomap layer > +for IO serialisation (i_rwsem, invalidation lock) which is generally taken > +before calling into iomap iter functions. There is also locking below iomap for > +mapping/allocation serialisation on an inode (e.g. XFS_ILOCK or i_data_sem in > +ext4 etc) that is usually taken inside the mapping methods which filesystems > +supplied to the iomap infrastructure. This layer of locking needs to be > +independent of the IO path serialisation locking as it nests inside in the IO > +path but is also used without the filesystem IO path locking protecting it > +(e.g. in the iomap writeback path). > + > +General Locking order in iomap is: > +inode->i_rwsem (shared or exclusive) > + inode->i_mapping->invalidate_lock (exclusive) > + folio_lock() > + internal filesystem allocation lock (e.g. XFS_ILOCK or i_data_sem) > + > + > +Zeroing/Truncate Operations > +=========================== > +Filesystems can use iomap provided helper functions e.g. iomap_zero_range(), > +iomap_truncate_page() & iomap_file_unshare() for various truncate/fallocate or > +any other similar operations that requires zeroing/truncate. ^^^ require > +See above functions for more details on how these can be used by individual > +filesystems. > + > + > +Guideline for filesystem conversion to iomap > +============================================= > +The right approach is to first implement ->iomap_begin and (if necessary) > +->iomap_end to allow iomap to obtain a read-only mapping of a file range. In > +most cases, this is a relatively trivial conversion of the existing get_block() > +callback for read-only mappings. > + > +i.e. rewrite the filesystem's get_block(create = false) implementation to use ^^^ I.e. > +the new ->iomap_begin() implementation. i.e. get_block wraps around the outside ^^^ I don't understand this sentence... > +and converts the information from bufferhead-based map to what iomap expects. > +This will convert all the existing read-only mapping users to use the new iomap > +mapping function internally. This way the iomap mapping function can be further > +tested without needing to implement any other iomap APIs. > + > +FIEMAP operation is a really good first target because it is trivial to > +implement support for it and then to determine that the extent map iteration is > +correct from userspace. i.e. if FIEMAP is returning the correct information, ^^ I.e., > +it's a good sign that other read-only mapping operations will also do the right > +thing. > + > +Once everything is working like this, then convert all the other read-only > +mapping operations to use iomap. Done one at a time, regressions should be self > +evident. The only likely complexity at this point will be the buffered read IO > +path because of bufferheads. The buffered read IO paths doesn't need to be > +converted yet, though the direct IO read path should be converted in this phase. > + > +The next thing to do is implement get_blocks(create = true) functionality in the > +->iomap_begin/end() methods. Then convert the direct IO write path to iomap, and > +start running fsx w/ DIO enabled in earnest on filesystem. This will flush out > +lots of data integrity corner case bug that the new write mapping implementation ^^^ bugs > +introduces. > + > +(TODO - get more info on this from Dave): At this point, converting the entire > +get_blocks() path to call the iomap functions and convert the iomaps to > +bufferhead maps is possible. This will get the entire filesystem using the new ^^^ make ^^ use > +mapping functions, and they should largely be debugged and working correctly > +after this step. > + > +This now largely leaves the buffered read and write paths to be converted. The > +mapping functions should all work correctly, so all that needs to be done is > +rewriting all the code that interfaces with bufferheads to interface with iomap > +and folios. It is rather easier first to get regular file I/O (without any > +fancy feature like fscrypt, fsverity, data=journaling) converted to use iomap > +and then work on directory handling conversion to iomap. Well, I belive this comment about directories is pretty much specific for ext2. Iomap never attempted to provide any infrastructure for metadata handling - each filesystem is on its own here. ext2 is kind of special as it was experimenting with handling directories more like data. In retrospect I don't think it was particularly successful experiment but you never know until you try :). > + > +The rest is left as an exercise for the reader, as it will be different for > +every filesystem. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR