Re: [PATCH v7 00/11] Introduce Zone Append for writing to zoned block devices

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Sun, 19 Apr 2020 22:51:14 +0000

On 2020/04/18 10:01, Theodore Y. Ts'o wrote:
> On Fri, Apr 17, 2020 at 05:48:20PM +0000, Johannes Thumshirn wrote:
>> For "userspace's responsibility", I'd re-phrase this as "a consumer's 
>> responsibility", as we don't have an interface which aims at user-space 
>> yet. The only consumer this series implements is zonefs, although we did 
>> have an AIO implementation for early testing and io_uring shouldn't be 
>> too hard to implement.
> 
> Ah, I had assumed that userspace interface exposed would be opening
> the block device with the O_APPEND flag.  (Which raises interesting
> questions if the block device is also opened without O_APPEND and some
> other thread was writing to the same zone, in which case the order in
> which requests are processed would control whether the I/O would
> fail.)

O_APPEND has no effect for raw block device files since the file size is always
0. While we did use this flag initially for quick tests of user space interface,
it was a hack. Any proper implementation of a user space interface will probably
need a new RWF_ flag that can be passed to aios (io_submit() and io_uring) and
preadv2()/pwritev2() calls.

As for the case of one application doing regular writes and another doing zone
append writes to the same zone, you are correct, there will be errors. But not
for the zone append writes: they will all succeed since by definition, these do
not need the current zone write pointer and always append at the zone current
wp, wherever it is (with the zone not being full that is). Most of the regular
writes will likely fail since without synchronization between the applications,
the write pointer for the target zone would constantly change under the issuer
of the regular writes, even if that issuer uses report zones before any write
operation.

There is no automatic synchronization in the kernel for this and we do not
intend to add any: such bad use case is similar to 2 non-synchronized writers
issuing regular writes to the same zone. This cannot work correctly without
mutual exclusion in the IOs issuing path and that is the responsibility of the
user, be it an application process or an in-kernel component.

As Johannes pointed out, once BIOs aare submitted, the kernel does guarantee
ordered dispatching of writes per zone with zone write locking (mq-deadline).

-- 
Damien Le Moal
Western Digital Research