Re: [LSF/MM TOPIC] Zoned Block Devices

Matias Bjorling <Matias.Bjorling@xxxxxxx> · Mon, 28 Jan 2019 18:40:10 +0000

On 1/28/19 4:07 PM, Bart Van Assche wrote:
> On 1/28/19 4:56 AM, Matias Bjorling wrote:
>> Damien and I would like to propose a couple of topics centering around
>> zoned block devices:
>>
>> 1) Zoned block devices require that writes to a zone are sequential. If
>> the writes are dispatched to the device out of order, the drive rejects
>> the write with a write failure.
>>
>> So far it has been the responsibility the deadline I/O scheduler to
>> serialize writes to zones to avoid intra-zone write command reordering.
>> This I/O scheduler based approach has worked so far for HDDs, but we can
>> do better for multi-queue devices. NVMe has support for multiple queues,
>> and one could dedicate a single queue to writes alone. Furthermore, the
>> queue is processed in-order, enabling the host to serialize writes on
>> the queue, instead of issuing them one by one. We like to gather
>> feedback on this approach (new HCTX_TYPE_WRITE).
>>
>> 2) Adoption of Zone Append in file-systems and user-space applications.
>>
>> A Zone Append command, together with Zoned Namespaces, is being defined
>> in the NVMe workgroup. The new command allows one to automatically
>> direct writes to a zone write pointer position, similarly to writing to
>> a file open with O_APPEND. With this write append command, the drive
>> returns where data was written in the zone. Providing two benefits:
>>
>> (A) It moves the fine-grained logical block allocation in file-systems
>> to the device side. A file-system continues to do coarse-grained logical
>> block allocation, but the specific LBAs where data is written and
>> reported from the device. Thus improving file-system performance. The
>> current target is XFS but we would like to hear the feasibility of it
>> being used in other file-systems.
>>
>> (B) It lets host issue multiple outstanding write I/Os to a zone,
>> without having to maintain I/O order. Thus, improving the performance of
>> the drive, but also reducing the need for zone locking on the host side.
>>
>> Is there other use-cases for this, and will an interface like this be
>> valuable in the kernel? If the interface is successful, we would expect
>> the interface to move to ATA/SCSI for standardization as well.
>
> Hi Matias,
>
> This topic proposal sounds interesting to me, but I think it is 
> incomplete. Shouldn't it also be discussed how user space applications 
> are expected to submit "zone append" writes? Which system call should 
> e.g. fio use to submit this new type of write request? How will the 
> offset at which data has been written be communicated back to user space?
>
> Thanks,
>
> Bart.

Hi Bart,

That's a good point. Originally, we only looked into support for 
file-systems due to the complexity of exposing it to user-space (e.g., 
we do not have an easy way to support psync/libaio workloads). I would 
love for us to be able to combine this with liburing, such that an LBA 
can be returned on I/O completion. However, I'm not sure we have enough 
bits available on the completion entry.

-Matias