Re: [PATCH] Documentation: document the design of iomap and how to port

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 11 Jun 2024 11:32:31 +1000

On Mon, Jun 10, 2024 at 02:59:28PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 10, 2024 at 04:18:08PM +0200, Jan Kara wrote:
> > On Sun 09-06-24 08:55:06, Darrick J. Wong wrote:
> > >        * validity_cookie is a magic freshness value set by the
> > >          filesystem that should be used to detect stale mappings. For
> > >          pagecache operations this is critical for correct operation
> > >          because page faults can occur, which implies that filesystem
> > >          locks should not be held between ->iomap_begin and
> > >          ->iomap_end. Filesystems with completely static mappings
> > >          need not set this value. Only pagecache operations
> > >          revalidate mappings.
> > > 
> > >          XXX: Should fsdax revalidate as well?
> > 
> > AFAICT no. DAX is more like using direct IO for everything. So no writeback
> > changing mapping state behind your back (and that's the only thing that is
> > not serialized with i_rwsem or invalidate_lock). Maybe this fact can be
> > mentioned somewhere around the discussion of iomap_valid() as a way how
> > locking usually works out?
> 
> <nod> I'll put that in the section about iomap_valid, which documents
> the whole mechanism more thoroughly:
> 
> "iomap_valid: The filesystem may not hold locks between ->iomap_begin
> and ->iomap_end because pagecache operations can take folio locks, fault
> on userspace pages, initiate writeback for memory reclamation, or engage
> in other time-consuming actions.  If a file's space mapping data are
> mutable, it is possible that the mapping for a particular pagecache
> folio can change in the time it takes to allocate, install, and lock
> that folio.
> 
> "For the pagecache, races can happen if writeback doesn't take i_rwsem
> or invalidate_lock and updates mapping information.  Races can also
> happen if the filesytem allows concurrent writes.  For such files, the
> mapping *must* be revalidated after the folio lock has been taken so
> that iomap can manage the folio correctly.
> 
> "fsdax does not need this revalidation because there's no writeback and
> no support for unwritten extents.
> 
> "The filesystem's ->iomap_begin function must sample a sequence counter
> into struct iomap::validity_cookie at the same time that it populates
> the mapping fields.  It must then provide a ->iomap_valid function to
> compare the validity cookie against the source counter and return
> whether or not the mapping is still valid.  If the mapping is not valid,
> the mapping will be sampled again."

Everything is fine except for the last paragraph. Filesystems do not
need to use a sequence counter for the validity cookie - they can
use any mechanism they want to determine that the extent state has
changed. If they can track extent status on a fine grained basis
(e.g. ext4 has individual extent state cache entries) then that
validity cookie could be a pointer to the filesystem's internal
extent state cache entry.

There's also nothing that says the cookie needs to remain a u64;
I've been toying with replacing it with a pointer to an xfs iext
btree cursor and checking that the cursor still points to the same
extent that iomap was made from. Hence we get fine grained extent
checks and don't have to worry about changes outside the range of
the iomap causing spurious revalidation failures.

IOWs, the intention of the cookie is simply an opaque blob that the
filesystem can keep whatever validity information in it that it
wants - a sequence counter is just one of many possible
implementations. hence I think this should be rephrased to reflect
this.

"The filesystem's ->iomap_begin function must write the data it
needs to validate the iomap into struct iomap::validity_cookie at
the same time that it populates the mapping fields.  It must then
provide a ->iomap_valid function to compare the validity cookie
against the source data and return whether or not the mapping is
still valid.  If the mapping is not valid, the mapping will be
sampled again.

A simple validation cookie implementation is a sequence counter. If
the filesystem bumps the sequence counter every time it modifies the
inode's extent map, it can be placed in the struct
iomap::validity_cookie during ->iomap_begin. If the value in the
cookie is found to be different to the value the filesystem holds
when it is passed back to ->iomap_valid, then the iomap should
considered stale and the validation failed."

> 
> > >    These struct kiocb flags are significant for buffered I/O with
> > >    iomap:
> > > 
> > >        * IOCB_NOWAIT: Only proceed with the I/O if mapping data are
> > >          already in memory, we do not have to initiate other I/O, and
> > >          we acquire all filesystem locks without blocking. Neither
> > >          this flag nor its definition RWF_NOWAIT actually define what
> > >          this flag means, so this is the best the author could come
> > >          up with.
> > 
> > RWF_NOWAIT is a performance feature, not a correctness one, hence the
> > meaning is somewhat vague. It is meant to mean "do the IO only if it
> > doesn't involve waiting for other IO or other time expensive operations".
> > Generally we translate it to "don't wait for i_rwsem, page locks, don't do
> > block allocation, etc." OTOH we don't bother to specialcase internal
> > filesystem locks (such as EXT4_I(inode)->i_data_sem)

But we have support for this - the IOMAP_NOWAIT flag tells the
filesystem iomap methods that it should:

	a) not block on internal mapping operations; and
	b) fail with -EAGAIN if it can't map the entire range in a
	   single iomap.

XFS implements these semantics for internal locks and operations
needed for mapping operations, and any filesystem that uses iomap
-should- be doing the same thing.

> > and we get away with
> > it because blocking on it under constraints we generally perform RWF_NOWAIT
> > IO is exceedingly rare.
> 
> I hate this flag's undocumented nature.  It now makes *documenting*
> things around it hard.  How about:

But it is documented.

RWF_NOWAIT (since Linux 4.14)

      Do not wait for data which is not immediately available.  If
      this flag is specified, the preadv2() system call will return
      instantly if  it  would have  to  read data from the backing
      storage or wait for a lock. [...]

Yes, not well documented. But the *intent* is clear. It's basically
the same thing as O_NONBLOCK:

O_NONBLOCK or O_NDELAY
      When  possible,  the file is opened in nonblocking mode.
      Neither the open() nor any subsequent I/O operations on the
      file descriptor which is returned will cause the calling
      process to wait.

That's what we are supposed to be implementing with IOCB_NOWAIT -
don't block the submitting task if at all possible.

> "IOCB_NOWAIT: Neither this flag nor its associated definition
> RWF_NOWAIT actually specify what this flag means.  Community
> members seem to think that it means only proceed with the I/O if
> it doesn't involve waiting for expensive operations.  XFS and ext4
> appear to reject the IO unless the mapping data are already in
> memory, the filesystem does not have to initiate other I/O, and
> the kernel can acquire all filesystem locks without blocking."

I think the passive-aggressive nature of this statement doesn't read
very well. Blaming "community members" for organic code development
that solved poorly defined, undocumented bleeding edge issues isn't
a winning strategy. And other developers don't care about this; they
just want to know what they should be implementing.

So let's just make a clear statement about the intent of
IOCB_NOWAIT, because that is *always* been very clear right from the
start, even though there was no way we could implement the intent
right from the start.

"IOCB_NOWAIT: Perform a best effort attempt to avoid any operation
that would result in blocking the submitting task. This is similar
in intent to O_NONBLOCK for network APIs - it is intended for
asynchronous applications to keep doing other work instead of
waiting for the specific unavailable filesystem resource to become
available.

This means the filesystem implementing IOCB_NOWAIT semantics need to
use trylock algorithms.  They need to be able to satisfy the entire
IO request range in a single iomap mapping. They need to avoid
reading or writing metadata synchronously. They need to avoid
blocking memory allocations. They need to avoid waiting on
transaction reservations to allow modifications to take place. And
so on.

If there is any doubt in the filesystem developer's mind as to
whether any specific IOCB_NOWAIT operation may end up blocking, then
they should return -EAGAIN as early as possible rather than start
the operation and force the submitting task to block."

This clearly documents the intent and what the submitters are
expecting from filesystems when this flag is set. IT also tells
filesystem implementers the way to handle IOCB_NOWAIT if they
haven't actually coded any of this - abort at the top of the IO
stack with -EAGAIN - and applications using this will work
correctly.

> > >     Direct Writes
> > > 
> > >    A direct I/O write initiates a write I/O to the storage device to
> > >    the caller's buffer. Dirty parts of the pagecache are flushed to
> > >    storage before initiating the write io. The pagecache is
> > >    invalidated both before and after the write io. The flags value
> > >    for ->iomap_begin will be IOMAP_DIRECT | IOMAP_WRITE with any
> > >    combination of the following enhancements:
> > > 
> > >        * IOMAP_NOWAIT: Write if mapping data are already in memory.
> > >          Does not initiate other I/O or block on filesystem locks.

IOMAP_NOWAIT is not specific to direct IO. It is supported for both
buffered reads and writes in iomap and XFS.

It also tends to imply "don't allocate new blocks" for journalling
filesysetms because that requires journal space reservation (which
blocks), memory allocation, metadata creation and metadata IO.

> > >        * IOMAP_OVERWRITE_ONLY: Allocating blocks and zeroing partial
> > >          blocks is not allowed. The entire file range must to map to
> > 							     ^^ extra "to"
> > 
> > >          a single written or unwritten extent. The file I/O range
> > >          must be aligned to the filesystem block size.

> > This seems to be XFS specific thing? At least I don't see anything in
> > generic iomap code depending on this?
> 
> Hmm.  XFS bails out if the mapping is unwritten and the directio write
> range isn't aligned to the fsblock size.

I think that is wrong. IOMAP_OVERWRITE_ONLY is specifically for
allowing unaligned IO to be performed without triggering allocation
or sub-block zeroing. It requests a mapping to allow a pure
overwrite to be performed, otherwise it fails.

XFS first it checks if allocation is required. If so, it bails.

Second it checks if the extent spans the range requested. If not, it
bails.

Finally, it checks if the extent is IOMAP_WRITTEN. If not, it bails.

> I think the reason for that is
> because we'd have to zero the unaligned regions outside of the write
> range, and xfs can't do that without synchronizing.  (Or we didn't think

Right, it enables an optimistic fast path for unaligned direct
IOs. We take the i_rwsem shared, attempt the fast path, if it is
rejected with -EAGAIN we drop the shared lock, take it exclusive
and run the IO again without IOMAP_DIO_OVERWRITE_ONLY being set
to allow allocation and/or sub-block zeroing to be performed.

I think this needs to be written in terms of what a "pure overwrite"
is. We use that term repeatedly in the iomap code (e.g. for FUA
optimisations), and IOMAP_OVERWRITE_ONLY is a mechanism for the
caller to ask "give me a pure overwrite mapping for this range or
fail with -EAGAIN". This is the mechanism that provides the required
IOMAP_DIO_OVERWRITE_ONLY semantics.

"Pure Overwrite: A write operation that does not require any
metadata or zeroing operations to perform during either submission
or completion. This implies that the fileystem must have already
allocated space on disk as IOMAP_WRITTEN and the filesystem must not
place any constaints on IO alignment or size - the only constraints
on IO alignment are device level (minimum IO size and alignment,
typically sector size).

IOMAP_DIO_OVERWRITE_ONLY: Perform a pure overwrite for this range or
fail with -EAGAIN.

This can be used by filesystems with complex unaligned IO
write paths to provide an optimised fast path for unaligned writes.
If a pure overwrite can be performed, then serialisation against
other IOs to the same filesystem block(s) is unnecessary as there is
no risk of stale data exposure or data loss.

If a pure overwrite cannot be performed, then the filesystem can
perform the serialisation steps needed to provide exclusive access
to the unaligned IO range so that it can perform allocation and
sub-block zeroing safely.

IOMAP_OVERWRITE_ONLY: The caller requires a pure overwrite to be
performed from this mapping. This requires the filesystem extent
mapping to already exist as an IOMAP_MAPPED type and span the entire
range of the write IO request. If the filesystem cannot map this
request in a way that allows the iomap infrastructure to perform
a pure overwrite, it must fail the mapping operation with -EAGAIN."

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx