<snip> > +Direct I/O > +---------- > + > +In Linux, direct I/O is defined as file I/O that is issued directly to > +storage, bypassing the pagecache. > + > +The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and > +writes for files. > +An optional ``ops`` parameter can be passed to change the behavior of > +direct I/O. Did you mean "dops" iomap_dio_ops (instead of ops)? ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, const struct iomap_dio_ops *dops, unsigned int dio_flags, void *private, size_t done_before); 1. Also can you please explain what you meant by "change the behavior of direct-io"? 2. Do you think we should add the function declaration of iomap_dio_rw() here, given it has so many arguments? > +The ``done_before`` parameter should be set if writes have been > +initiated prior to the call. I don't think this is specific to "writes" alone. Maybe this? The ``done_before`` parameter tells the how much of the request has already been transferred. It gets used for finishing a request asynchronously when part of the request has already been complete synchronously. Maybe please also add a the link to this (for easy reference). [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103 > +The direction of the I/O is determined from the ``iocb`` passed in. > + > +The ``flags`` argument can be any of the following values: Callers of iomap_dio_rw() can set the flags argument which can be any of the following values: Just a bit more descriptive ^^^ > + > + * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the > + kiocb is not synchronous. Adding an example would be nice. e.g. callers might want to consider setting this flag for extending writes. > + > + * ``IOMAP_DIO_OVERWRITE_ONLY``: Allocating blocks, zeroing partial > + blocks, and extensions of the file size are not allowed. > + The entire file range must to map to a single written or unwritten ^^^ an extra to > + extent. > + This flag exists to enable issuing concurrent direct IOs with only > + the shared ``i_rwsem`` held when the file I/O range is not aligned to > + the filesystem block size. > + ``-EAGAIN`` will be returned if the operation cannot proceed. Can we please add these below details too. I would rather avoid wasting my time in searching the history about, why EXT4 does not use this flag :) Currently XFS uses this flag. EXT4 does not use it since it checks for overwrites or unaligned overwrites and uses appropriate locking up front rather than on a retry response to -EAGAIN [1] [2]. [1]: https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@xxxxxxxxxx/ [2]: https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@xxxxxxxxxx/ > + > + * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever > + progress has already been made. > + The caller may deal with the page fault and retry the operation. Callers use ``dio_before`` argument along with ``IOMAP_DIO_PARTIAL`` to tell the iomap subsystem about how much of the requested I/O was already done. > + > +These ``struct kiocb`` flags are significant for direct I/O with iomap: > + > + * ``IOCB_NOWAIT``: Only proceed with the I/O if mapping data are > + already in memory, we do not have to initiate other I/O, and we > + acquire all filesystem locks without blocking. > + Maybe explicitly mentioning about "no block allocation"? * ``IOCB_NOWAIT``: Only proceed with the I/O if mapping data are already in memory, we do not have to initiate other I/O or do any block allocations, and we acquire all filesystem locks without blocking. > + * ``IOCB_SYNC``: Ensure that the device has persisted data to disk > + BEFORE completing the call. > + In the case of pure overwrites, the I/O may be issued with FUA > + enabled. > + > + * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an > + interrupt. > + Only meaningful for asynchronous I/O, and only if the entire I/O can > + be issued as a single ``struct bio``. > + > + * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's > + process context. > + See ``linux/fs.h`` for more details. > + > +Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and > +``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open`` > +function for the file. > +They should not set ``->direct_IO``, which is deprecated. > + Return value: ~~~~~~~~~~~~~ On success it will return the number of bytes transferred. On failure it will return a negative error value. Note: -ENOTBLK is a magic return value which callers may use for falling back to buffered-io. ->iomap_end()/->iomap_begin() can decide to return this magic return value if it decides to fallback to buffered-io. iomap subsystem return this value in case if it fails to invalidate the pagecache pages belonging to the direct-io range before initiating the direct-io. -EIOCBQUEUED is returned when an async direct-io request is queued for I/O. > +If a filesystem wishes to perform its own work before direct I/O > +completion, it should call ``__iomap_dio_rw``. > +If its return value is not an error pointer or a NULL pointer, the > +filesystem should pass the return value to ``iomap_dio_complete`` after > +finishing its internal work. > + > +Direct Reads > +~~~~~~~~~~~~ > + > +A direct I/O read initiates a read I/O from the storage device to the > +caller's buffer. > +Dirty parts of the pagecache are flushed to storage before initiating > +the read io. > +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with > +any combination of the following enhancements: > + > + * ``IOMAP_NOWAIT``: Read if mapping data are already in memory. > + Does not initiate other I/O or block on filesystem locks. > + > +Callers commonly hold ``i_rwsem`` in shared mode. > + > +Direct Writes > +~~~~~~~~~~~~~ > + > +A direct I/O write initiates a write I/O to the storage device to the > +caller's buffer. to the storage device "from" the caller's buffer. > +Dirty parts of the pagecache are flushed to storage before initiating > +the write io. > +The pagecache is invalidated both before and after the write io. > +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT | > +IOMAP_WRITE`` with any combination of the following enhancements: > + > + * ``IOMAP_NOWAIT``: Write if mapping data are already in memory. > + Does not initiate other I/O or block on filesystem locks. > + * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial > + blocks is not allowed. > + The entire file range must to map to a single written or unwritten > + extent. > + The file I/O range must be aligned to the filesystem block size. > + > +Callers commonly hold ``i_rwsem`` in shared or exclusive mode. > + > +struct iomap_dio_ops: > +~~~~~~~~~~~~~~~~~~~~~ > +.. code-block:: c > + > + struct iomap_dio_ops { > + void (*submit_io)(const struct iomap_iter *iter, struct bio *bio, > + loff_t file_offset); > + int (*end_io)(struct kiocb *iocb, ssize_t size, int error, > + unsigned flags); > + struct bio_set *bio_set; > + }; > + > +The fields of this structure are as follows: > + > + - ``submit_io``: iomap calls this function when it has constructed a > + ``struct bio`` object for the I/O requested, and wishes to submit it > + to the block device. > + If no function is provided, ``submit_bio`` will be called directly. > + Filesystems that would like to perform additional work before (e.g. > + data replication for btrfs) should implement this function. > + > + - ``end_io``: This is called after the ``struct bio`` completes. > + This function should perform post-write conversions of unwritten > + extent mappings, handle write failures, etc. > + The ``flags`` argument may be set to a combination of the following: > + > + * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend > + should mark the extent as written. > + > + * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a > + copy on write operation, so the ioend should switch mappings. > + > + - ``bio_set``: This allows the filesystem to provide a custom bio_set > + for allocating direct I/O bios. > + This enables filesystems to `stash additional per-bio information > + <https://lore.kernel.org/all/20220505201115.937837-3-hch@xxxxxx/>`_ > + for private use. > + If this field is NULL, generic ``struct bio`` objects will be used. > + > +Filesystems that want to perform extra work after an I/O completion > +should set a custom ``->bi_end_io`` function via ``->submit_io``. > +Afterwards, the custom endio function must call > +``iomap_dio_bio_end_io`` to finish the direct I/O. > + > +DAX I/O > +------- > + > +Storage devices that can be directly mapped as memory support a new > +access mode known as "fsdax". Added a comma before "support" for better readability. Storage devices that can be directly mapped as memory, support a new access mode known as "fsdax". > + > +fsdax Reads > +~~~~~~~~~~~ > + > +A fsdax read performs a memcpy from storage device to the caller's > +buffer. > +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any > +combination of the following enhancements: > + > + * ``IOMAP_NOWAIT``: Read if mapping data are already in memory. > + Does not initiate other I/O or block on filesystem locks. > + > +Callers commonly hold ``i_rwsem`` in shared mode. > + > +fsdax Writes > +~~~~~~~~~~~~ > + > +A fsdax write initiates a memcpy to the storage device to the caller's "from" the storage device > +buffer. > +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX | > +IOMAP_WRITE`` with any combination of the following enhancements: > + > + * ``IOMAP_NOWAIT``: Write if mapping data are already in memory. > + Does not initiate other I/O or block on filesystem locks. > + > + * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial > + blocks is not allowed. > + The entire file range must to map to a single written or unwritten > + extent. > + The file I/O range must be aligned to the filesystem block size. > + > +Callers commonly hold ``i_rwsem`` in exclusive mode. > + > +mmap Faults > +~~~~~~~~~~~ > + > +The ``dax_iomap_fault`` function handles read and write faults to fsdax > +storage. > +For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the > +``flags`` argument to ``->iomap_begin``. > +For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be > +passed as the ``flags`` argument to ``->iomap_begin``. > + > +Callers commonly hold the same locks as they do to call their iomap > +pagecache counterparts. > + > +Truncation, fallocate, and Unsharing > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +For fsdax files, the following functions are provided to replace their > +iomap pagecache I/O counterparts. > +The ``flags`` argument to ``->iomap_begin`` are the same as the > +pagecache counterparts, with ``IOMAP_DIO`` added. with "IOMAP_DAX" > + > + * ``dax_file_unshare`` > + * ``dax_zero_range`` > + * ``dax_truncate_page`` Shall we mention "dax_remap_file_range_prep()/dax_dedupe_file_range_compare()" ? > + > +Callers commonly hold the same locks as they do to call their iomap > +pagecache counterparts. > + Stopping here for now. Will resume the rest of the document review later. -ritesh