Re: [PATCH 5/5] nvme: support for zoned namespaces

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Wed, 17 Jun 2020 06:54:29 +0000

On 2020/06/17 15:18, Javier González wrote:
> On 17.06.2020 00:38, Damien Le Moal wrote:
>> On 2020/06/17 1:13, Javier González wrote:
>>> On 16.06.2020 09:07, Keith Busch wrote:
>>>> On Tue, Jun 16, 2020 at 05:55:26PM +0200, Javier González wrote:
>>>>> On 16.06.2020 08:48, Keith Busch wrote:
>>>>>> On Tue, Jun 16, 2020 at 05:02:17PM +0200, Javier González wrote:
>>>>>>> This depends very much on how the FS / application is managing
>>>>>>> stripping. At the moment our main use case is enabling user-space
>>>>>>> applications submitting I/Os to raw ZNS devices through the kernel.
>>>>>>>
>>>>>>> Can we enable this use case to start with?
>>>>>>
>>>>>> I think this already provides that. You can set the nsid value to
>>>>>> whatever you want in the passthrough interface, so a namespace block
>>>>>> device is not required to issue I/O to a ZNS namespace from user space.
>>>>>
>>>>> Mmmmm. Problem now is that the check on the nvme driver prevents the ZNS
>>>>> namespace from being initialized. Am I missing something?
>>>>
>>>> Hm, okay, it may not work for you. We need the driver to create at least
>>>> one namespace so that we have tags and request_queue. If you have that,
>>>> you can issue IO to any other attached namespace through the passthrough
>>>> interface, but we can't assume there is an available namespace.
>>>
>>> That makes sense for now.
>>>
>>> The next step for us is to enable a passthrough on uring, making sure
>>> that I/Os do not split.
>>
>> Passthrough as in "application issues directly NVMe commands" like for SG_IO
>> with SCSI ? Or do you mean raw block device file accesses by the application,
>> meaning that the IO goes through the block IO stack as opposed to directly going
>> to the driver ?
>>
>> For the latter case, I do not think it is possible to guarantee that an IO will
>> not get split unless we are talking about single page IOs (e.g. 4K on X86). See
>> a somewhat similar request here and comments about it.
>>
>> https://www.spinics.net/lists/linux-block/msg55079.html
> 
> At the moment we are doing the former, but it looks like a hack to me to
> go directly to the NVMe driver.

That is what the nvme driver ioctl() is for no ? An application can send an NVMe
command directly to the driver with it. That is not a hack, but the regular way
of doing passthrough for NVMe, isn't it ?

> I was thinking that we could enable the second path by making use of
> chunk_sectors and limit the I/O size just as the append_max_io_size
> does. Is this the complete wrong way of looking at it?

The block layer cannot limit the size of a passthrough command since the command
is protocol specific and the block layer is a protocol independent interface.
SCSI SG does not split passthrough requests, it cannot. For passthrough
commands, the command buffer can be dma-mapped or it cannot. If mapping
succeeds, the command is issued. If it cannot, the command is failed. At least,
that is my understanding of how the stack is working.

> 
> Thanks,
> Javier
> 
> _______________________________________________
> linux-nvme mailing list
> linux-nvme@xxxxxxxxxxxxxxxxxxx
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

-- 
Damien Le Moal
Western Digital Research