Re: [PATCH 2/2] zonefs: use zone-append for AIO as well

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Mon, 27 Jul 2020 03:12:47 +0000

On 2020/07/24 22:58, Kanchan Joshi wrote:
> On Wed, Jul 22, 2020 at 8:22 PM Christoph Hellwig <hch@xxxxxx> wrote:
>>
>> On Wed, Jul 22, 2020 at 12:43:21PM +0000, Johannes Thumshirn wrote:
>>> On 21/07/2020 07:54, Christoph Hellwig wrote:
>>>> On Mon, Jul 20, 2020 at 04:48:50PM +0000, Johannes Thumshirn wrote:
>>>>> On 20/07/2020 15:45, Christoph Hellwig wrote:
>>>>>> On Mon, Jul 20, 2020 at 10:21:18PM +0900, Johannes Thumshirn wrote:
>>>>>>> On a successful completion, the position the data is written to is
>>>>>>> returned via AIO's res2 field to the calling application.
>>>>>>
>>>>>> That is a major, and except for this changelog, undocumented ABI
>>>>>> change.  We had the whole discussion about reporting append results
>>>>>> in a few threads and the issues with that in io_uring.  So let's
>>>>>> have that discussion there and don't mix it up with how zonefs
>>>>>> writes data.  Without that a lot of the boilerplate code should
>>>>>> also go away.
>>>>>>
>>>>>
>>>>> OK maybe I didn't remember correctly, but wasn't this all around
>>>>> io_uring and how we'd report the location back for raw block device
>>>>> access?
>>>>
>>>> Report the write offset.  The author seems to be hell bent on making
>>>> it block device specific, but that is a horrible idea as it is just
>>>> as useful for normal file systems (or zonefs).
> 
> Patchset only made the feature opt-in, due to the constraints that we
> had. ZoneFS was always considered and it fits as fine as block-IO.
> You already know that  we did not have enough room in io-uring, which
> did not really allow to think of other FS (any-size cached-writes).
> After working on multiple schemes in io_uring, now we have 64bits, and
> we will return absolute offset in bytes now (in V4).
> 
> But still, it comes at the cost of sacrificing the ability to do
> short-write, which is fine for zone-append but may trigger
> behavior-change for regular file-append.
> Write may become short if
> - spanning beyond end-of-file

For a O_APPEND/RWF_APPEND write, the file offset written is exactly *at* EOF.
There is no "write spanning EOF", the write is always completely beyond EOF.
This is not a problem, this is the normal behavior of append writes to regular
files.

> - going beyond RLIMIT_FSIZE limit
> - probably for MAX_NON_LFS as well

These limits apply to all write operations, regardless of zone append being used
or not.

> 
> We need to fail all above cases if we extend the current model for
> regular FS. And that may break existing file-append users.
> Class of applications which just append without caring about the exact
> location - attempt was not to affect these while we try to enable the
> path for zone-append.

It seems like you are confusing the interface semantic with its
implementation... For a regular POSIX compliant file system, the implementation
of asynchronous append IOs to a file has to comply to the posix defined
behavior, regardless of the underlying command used for issuing the writes to
the device. We have btrfs running in the lab using zone append for *all* file
data writes, and that does not change the behavior of any system call. xfstests
still pass. (Note: patches coming soon).

Implemented correctly, the written offset reporting change will also be backward
compatible for regular file systems: applications using O_APPEND/RWF_APPEND AIOs
to regular files today will continue working. We should have io_uring interface
backward compatible too. How to do that must first be flushed out: we need to
clarify the interface and its semantic first. Then the implementation will
naturally follow on solid ground.

For the interface semantic, 3 cases must be considered:

(1) Regular files: the only change is that the written *file offset* is returned
to the application in the completion path. No other change is needed. Form the
application perspective, the asynchronous append writes will still result in the
same *file* data layout as before, that is, data is written sequentially at the
end of the file, in the same order a AIOs are issued by the application.

(2) zonefs: This is not a POSIX file system, that is, *file* data placement is
directly dependent on device data placement. This means that for asynchronous
append writes, we need different behaviors:
  (a) Writes at the end of the file without O_APPEND/RWF_APPEND: the data must
be written exactly at the application specified offset, which excludes the use
of zone append writes.
  (b) Append writes with O_APPEND/RWF_APPEND: The plan is to use zone append for
these, with the result that file data may not end up being written in the same
order as AIOs issuing order. The other semantic change is that if one AIO is too
large, it will be failed. A write spanning the file zone capacity will be short
and any append write to a file with a zone already full will be failed (the file
maximum size is already reached when the zone is full).

(3) block device files: O_APPEND/RWF_APPEND is meaningless for these. So the
problems start here: this needs to be enabled in a sensible way for zoned block
devices to mean "the application wants to do a zone append". There should not be
any change for regular block devices. From there, the IO behavior is the same as
for zonefs case (2b) above.

Note: I may be forgetting some points in this list above. We need to complete
this into a coherent specification, including io_uring interface, and get
linux-aio and linux-api ACK to proceed.

> 
> Patches use O/RWF_APPEND, but try to isolate appending-write
> (IOCB_APPEND) from appending-write-that-returns-location
> (IOCB_ZONE_APPEND - can be renamed when we actually have all that it
> takes to apply the feature in regular FS).

And back to Christoph's point: this isolation is not necessary. For an append
asynchronous write, we can return the written *file offset* location for all cases.

> Enabling block-IO and zoneFS now, and keeping regular-FS as future
> work - hope that does not sound too bad!

Implementing the written offset reporting interface will be done in the generic
VFS upper layer, and that will naturally enable regular file systems too. This
should not be a future work, especially if you consider zonefs, since that is a
file system (not a regular one, but the interface is the same as that of a
regular file system).

>>> After having looked into io_uring I don't this there is anything that
>>> prevents io_uring from picking up the write offset from ki_complete's
>>> res2 argument. As of now io_uring ignores the filed but that can be
>>> changed.
> 
> We use ret2 of ki_complete to collect append-offset in io_uring too.
> It's just that unlike aio it required some work to send it to user-space.
> 
> 
> --
> Kanchan Joshi
> 

-- 
Damien Le Moal
Western Digital Research