Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Tue, 24 Nov 2020 11:29:53 +0000

On Tue, Nov 10, 2020 at 10:55:06AM -0800, Darrick J. Wong wrote:
> When we're wanting to use a ZONE_APPEND command, the @iomap structure
> has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set
> to IOMAP_MAPPED, but what should iomap->addr be set to?
> 
> I gather from what I see in zonefs and the relevant NVME proposal that
> iomap->addr should be set to the (byte) address of the zone we want to
> append to?  And if we do that, then bio->bi_iter.bi_sector will be set
> to sector address of iomap->addr, right?

Yes.

> Then when the IO completes, the block layer sets bio->bi_iter.bi_sector
> to wherever the drive told it that it actually wrote the bio, right?

Yes.

> If that's true, then that implies that need_zeroout must always be false
> for an append operation, right?  Does that also mean that the directio
> request has to be aligned to an fs block and not just the sector size?

I think so, yes.

> Can userspace send a directio append that crosses a zone boundary?  If
> so, what happens if a direct append to a lower address fails but a
> direct append to a higher address succeeds?

Userspace doesn't know about zone boundaries.  It can send I/O larger
than a zone, but the file system has to split it into multiple I/Os
just like when it has to cross and AG boundary in XFS.

> I'm also vaguely wondering how to communicate the write location back to
> the filesystem when the bio completes?  btrfs handles the bio completion
> completely so it doesn't have a problem, but for other filesystems
> (cough future xfs cough) either we'd have to add a new callback for
> append operations; or I guess everyone could hook the bio endio.
> 
> Admittedly that's not really your problem, and for all I know hch is
> already working on this.

I think any non-trivial file system needs to override the bio completion
handler for writes anyway, so this seems reasonable.  It might be worth
documenting, though.