Re: [PATCH 5/5] nvme: support for zoned namespaces

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Wed, 17 Jun 2020 07:29:29 +0000

On 2020/06/17 16:11, Javier González wrote:
> On 17.06.2020 06:54, Damien Le Moal wrote:
>> On 2020/06/17 15:18, Javier González wrote:
>>> On 17.06.2020 00:38, Damien Le Moal wrote:
>>>> On 2020/06/17 1:13, Javier González wrote:
>>>>> On 16.06.2020 09:07, Keith Busch wrote:
>>>>>> On Tue, Jun 16, 2020 at 05:55:26PM +0200, Javier González wrote:
>>>>>>> On 16.06.2020 08:48, Keith Busch wrote:
>>>>>>>> On Tue, Jun 16, 2020 at 05:02:17PM +0200, Javier González wrote:
>>>>>>>>> This depends very much on how the FS / application is managing
>>>>>>>>> stripping. At the moment our main use case is enabling user-space
>>>>>>>>> applications submitting I/Os to raw ZNS devices through the kernel.
>>>>>>>>>
>>>>>>>>> Can we enable this use case to start with?
>>>>>>>>
>>>>>>>> I think this already provides that. You can set the nsid value to
>>>>>>>> whatever you want in the passthrough interface, so a namespace block
>>>>>>>> device is not required to issue I/O to a ZNS namespace from user space.
>>>>>>>
>>>>>>> Mmmmm. Problem now is that the check on the nvme driver prevents the ZNS
>>>>>>> namespace from being initialized. Am I missing something?
>>>>>>
>>>>>> Hm, okay, it may not work for you. We need the driver to create at least
>>>>>> one namespace so that we have tags and request_queue. If you have that,
>>>>>> you can issue IO to any other attached namespace through the passthrough
>>>>>> interface, but we can't assume there is an available namespace.
>>>>>
>>>>> That makes sense for now.
>>>>>
>>>>> The next step for us is to enable a passthrough on uring, making sure
>>>>> that I/Os do not split.
>>>>
>>>> Passthrough as in "application issues directly NVMe commands" like for SG_IO
>>>> with SCSI ? Or do you mean raw block device file accesses by the application,
>>>> meaning that the IO goes through the block IO stack as opposed to directly going
>>>> to the driver ?
>>>>
>>>> For the latter case, I do not think it is possible to guarantee that an IO will
>>>> not get split unless we are talking about single page IOs (e.g. 4K on X86). See
>>>> a somewhat similar request here and comments about it.
>>>>
>>>> https://www.spinics.net/lists/linux-block/msg55079.html
>>>
>>> At the moment we are doing the former, but it looks like a hack to me to
>>> go directly to the NVMe driver.
>>
>> That is what the nvme driver ioctl() is for no ? An application can send an NVMe
>> command directly to the driver with it. That is not a hack, but the regular way
>> of doing passthrough for NVMe, isn't it ?
> 
> We have enabled it through uring to get async() passthru submission.
> Looks like a hack at the moment, but we might just send a RFC to have
> something concrete to based the discussion on.

Yes, that would clarify things.

>>> I was thinking that we could enable the second path by making use of
>>> chunk_sectors and limit the I/O size just as the append_max_io_size
>>> does. Is this the complete wrong way of looking at it?
>>
>> The block layer cannot limit the size of a passthrough command since the command
>> is protocol specific and the block layer is a protocol independent interface.
> 
> Agree. This work depend in the application being aware of a max I/O size
> at the moment. Down the road, we will remove (or at least limit a lot)
> this constraint for ZNS devices that can eventually cache out-of-order
> I/Os.

I/Os with a data buffer all need mapping for DMA, no matter the device
functionalities or the command being executed. With passthrough, I do not think
it is possible to have the block layer limit anything. It will likely always be
pass-or-fail. With passthrough, the application needs to understand what it is
doing.

> 
>> SCSI SG does not split passthrough requests, it cannot. For passthrough
>> commands, the command buffer can be dma-mapped or it cannot. If mapping
>> succeeds, the command is issued. If it cannot, the command is failed. At least,
>> that is my understanding of how the stack is working.
> 
> I am not familiar with SCSI SG. This looks like how the ioctl() passthru
> works in NVMe, but as mentioned above, we would like to enable an
> async() passthru path.

That is done with bsg for SCSI I believe. You may want to have a look around
there. The SG driver used to have the write() system call mapped to "issuing a
command" and read() for "getting a command result". That was removed however.
But I think bsg has a replacement for that defunct async passthrough interface.
Not sure. I have not looked at that for a while.

> 
> Thanks,
> Javier
> 

-- 
Damien Le Moal
Western Digital Research