Re: stop using ioends for direct write completions

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 1 Feb 2016 13:54:07 -0800

On Fri, Jan 29, 2016 at 09:12:33AM -0500, Brian Foster wrote:
> On Thu, Jan 28, 2016 at 05:16:56AM -0800, Christoph Hellwig wrote:
> > Any chance to get a review for this?  It should really help
> > with sorting out the buffered I/O COW code.
> > 
> 
> I haven't taken a closer look at the patch yet. I was kind of waiting
> for Dave to chime in because I'm a little confused over the back and
> forth nature of dio/ioend completion lately.
> 
> My understanding is that the original requirement for ioends here was to
> track the state necessary in order to defer (to wq) completions that had
> to allocate transactions. Eventually, the deferred buffer state was
> implemented and we no longer required an ioend for that, so we removed
> the ioends here:
> 
>   2ba6623 xfs: don't allocate an ioend for direct I/O completions
> 
> Then just a couple months later, we merged the following to re-add the
> ioend into the dio path:
> 
>   d5cc2e3f xfs: DIO needs an ioend for writes
> 
> I recall reviewing that one, but looking back afaict the ioend was used
> simply to enable reuse of the ioend completion code. Now we have this
> patch which presumably removes much of that code to eliminate the ioend
> allocation overhead.
> 
> Neither this patch nor the previous has any technical reasoning for one
> approach over the other in the commit logs, so afaics this appears to be
> a matter of preference. Can we get some agreement (or discussion) on
> what the right interface is to transfer information to dio completion?
> E.g., is this allocation overhead noticeable? Is ioend usage problematic
> for other reasons (such as copy-on-write)? Going back to the previous
> patch, were there explicit reasons for using ioends aside from code
> reuse? Note that the subsequent commit 6dfa1b67 ("xfs: handle DIO
> overwrite EOF update completion correctly") does refer to some problems
> not running dio completions in the right context... Dave?

I don't know the reason /for/ using ioends, but I'll share an argument in favor
of Christoph's patch to remove them:

Each ioend has a type code which determines what we do at the end of the IO.
For buffered writes, we can create as many ioends as necessary to handle all
the different dispositions of all the blocks in the range of dirty pages.

For directio, however, the requirements are a little different -- we want to be
able to say "change all the extents in this part of the file from unwritten to
written", and to be able to change i_size after a write.  We can reuse the
ioend structure to encode these desires until CoW came along.  Now we also want
to be able to say "remap all extents in this part of a file if the write
succeeds".  Since we can only have one type code per ioend, either we have to
hide extra offset/size fields in the ioend for this purpose, add a flags field,
revise the code to create a bunch of ioends for each disposition type,
unconditionally try to remap blocks for any reflink'd inode, or do what
Christoph proposes, which is to stop (ab)using ioends and simply encode status
flags into dio->private.

Christoph's approach seems like the easiest, since we're already provided
with the offset and size, and both the unwritten extent conversion and the
CoW remap code know how to iterate the data fork to look for the extents
they want to alter.

The overhead probably goes down once we get rid of the memory allocation,
but the primary purpose is to make life easier for CoW to avoid unnecessary
overhead.

Also: I've noticed that if I write 8MB file to a file backed by a failing
device, xfs_end_io_direct_write() gets called with size == 8MB even if the
writes failed.  If the write was to an unwritten extent, the extent will be
converted to a regular extent even though the blocks contain old file contents
and other garbage, which can subsequently be read off the disk.

For CoW I'd only want to remap the blocks if the write totally succeeds.  If a
disk error happens, the EIO code gets passed back to userspace, and it seems
logical to me that the file contents should remain unchanged.

To that end, I'm changing dio_iodone_t to pass an error code, and I think I
want to change XFS to avoid doing unwritten extent conversions when the write
fails.  Does that make sense?

--D

> 
> Brian
> 
> > _______________________________________________
> > xfs mailing list
> > xfs@xxxxxxxxxxx
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs