Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Wed, 26 Jun 2019 10:34:25 -0700

On Wed, Jun 26, 2019 at 11:10:17AM -0500, Goldwyn Rodrigues wrote:
> On  8:39 26/06, Christoph Hellwig wrote:
> > On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> > > > I can't say I'm a huge fan of this two iomaps in one method call
> > > > approach.  I always though two separate iomap iterations would be nicer,
> > > > but compared to that even the older hack with just the additional
> > > > src_addr seems a little better.
> > > 
> > > I am just expanding on your idea of using multiple iterations for the Cow case
> > > in the hope we can come out of a good design:
> > > 
> > > 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
> > >    which calls iomap_begin() for the respective filesystem.
> > > 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
> > >    struct with read addr information.
> > > 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
> > >    and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
> > > 4. btrfs_iomap_begin() fills up iomap structure with write information.
> > > 
> > > Step 3 seems out of place because iomap_apply should be iomap.type agnostic.
> > > Right?
> > > Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> > > this is the "real" write for iomap_begin to fill iomap?
> > > 
> > > If this is not how you imagined, could you elaborate on the dual iteration
> > > sequence?
> > 
> > Here are my thoughts from dealing with this from a while ago, all
> > XFS based of course.
> > 
> > If iomap_file_buffered_write is called on a page that is inside a COW
> > extent we have the following options:
> > 
> >  a) the page is updatodate or entirely overwritten.  We cn just allocate
> >     new COW blocks and return them, and we are done
> >  b) the page is not/partially uptodate and not entirely overwritten.
> > 
> > The latter case is the interesting one.  My thought was that iff the
> > IOMAP_F_SHARED flag is set __iomap_write_begin / iomap_read_page_sync
> > will then have to retreive the source information in some form.
> > 
> > My original plan was to just do a nested iomap_apply call, which would
> > need a special nested flag to not duplicate any locking the file
> > system might be holding between ->iomap_begin and ->iomap_end.
> > 
> > The upside here is that there is no additional overhead for the non-COW
> > path and the architecture looks relatively clean.  The downside is that
> > at least for XFS we usually have to look up the source information
> > anyway before allocating the COW destination extent, so we'd have to
> > cache that information somewhere or redo it, which would be rather
> > pointless.  At that point the idea of a srcaddr in the iomap becomes
> > interesting again - while it looks a little ugly from the architectural
> > POV it actually ends up having very practical benefits.

I think it's less complicated to pass both mappings out in a single
->iomap_begin call rather than have this dance where the fs tells iomap
to call back for the read mapping and then iomap calls back for the read
mapping with a special "don't take locks" flag.

For XFS specifically this means we can serve both mappings with a single
ILOCK cycle.

> So, do we move back to the design of adding an extra field of srcaddr?

TLDR: Please no.

> Honestly, I find the design of using an extra field srcaddr in iomap better
> and simpler versus passing additional iomap srcmap or multiple iterations.

Putting my long-range planning hat on, the usage story (from the fs'
perspective) here is:

"iomap wants to know how a file write should map to a disk write.  If
we're doing a straight overwrite of disk blocks then I should send back
the relevant mapping.  Sometimes I might need the write to go to a
totally different location than where the data is currently stored, so I
need to send back a second mapping."

Because iomap is now a general-purpose API, we need to think about the
read mapping for a moment:

 - For all disk-based filesystems we'll want the source address for the
   read mapping.

 - For filesystems that support "inline data" (which really just means
   the fs maintains its own buffers to file data) we'll also need the
   inline_data pointer.

 - For filesystems that support multiple devices (like btrfs) we'll also
   need a pointer to a block_device because we could be writing to a
   different device than the one that stores the data.  The prime
   example I can think of is reading data from disk A in some RAID
   stripe and writing to disk B in a different RAID stripe to solve the
   RAID5 hole... but you could just be lazy-migrating file data to less
   full or newer drives or whatever.

 - If we ever encounter a filesystem that supports multiple dax devices
   then we'll need a pointer to the dax_device too.  (That would be
   btrfs, since I thought your larger goal was to enable dax there...)

 - We're probably going to need the ability to pass flags for the read
   mapping at some point or another, so we need that too.

>From this, you can see that we need half the fields in the existing
struct iomap, and that's how I arrived at the idea of passing to
iomap_begin pointers to two iomaps instead of bolting these five fields
into struct iomap.

In XFS parlance (where the data fork stores mappings for on-disk data
and the cow fork stores mappings for), this means that our iomap_begin
data paths remain fairly straightforward:

	xfs_bmapi_read(ip, offset, XFS_DATA_FORK, &imap...);
	xfs_bmapi_read(ip, offset, XFS_COW_FORK, &cmap...);

	xfs_bmbt_to_iomap(ip, iomap, &imap...);
	iomap->type = IOMAP_COW;
	xfs_bmbt_to_iomap(ip, srcmap, &cmap...);

(It's more complicated than that, but that's approximately what we do
now.)

> Also, should we add another iomap type IOMAP_COW, or (re)use the flag
> IOMAP_F_SHARED during writes? IOW iomap type vs iomap flag.

I think they need to remain distinct types because the IOMAP_F_SHARED
flag means that the storage is shared amongst multiple owners (which
iomap_fiemap reports to userspace), whereas the IOMAP_COW type means
that RMW is required for unaligned writes.

btrfs has a strong tendency to require copy writes, but that doesn't
mean the extent is necessarily shared.

> Dave/Darrick, what are your thoughts?

I liked where the first 3 patches of this series were heading. :)

--D

> 
> -- 
> Goldwyn