Re: [PATCH 5/5] nvme: support for zoned namespaces

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 17.06.2020 07:29, Damien Le Moal wrote:
On 2020/06/17 16:11, Javier González wrote:
On 17.06.2020 06:54, Damien Le Moal wrote:
On 2020/06/17 15:18, Javier González wrote:
On 17.06.2020 00:38, Damien Le Moal wrote:
On 2020/06/17 1:13, Javier González wrote:
On 16.06.2020 09:07, Keith Busch wrote:
On Tue, Jun 16, 2020 at 05:55:26PM +0200, Javier González wrote:
On 16.06.2020 08:48, Keith Busch wrote:
On Tue, Jun 16, 2020 at 05:02:17PM +0200, Javier González wrote:
This depends very much on how the FS / application is managing
stripping. At the moment our main use case is enabling user-space
applications submitting I/Os to raw ZNS devices through the kernel.

Can we enable this use case to start with?

I think this already provides that. You can set the nsid value to
whatever you want in the passthrough interface, so a namespace block
device is not required to issue I/O to a ZNS namespace from user space.

Mmmmm. Problem now is that the check on the nvme driver prevents the ZNS
namespace from being initialized. Am I missing something?

Hm, okay, it may not work for you. We need the driver to create at least
one namespace so that we have tags and request_queue. If you have that,
you can issue IO to any other attached namespace through the passthrough
interface, but we can't assume there is an available namespace.

That makes sense for now.

The next step for us is to enable a passthrough on uring, making sure
that I/Os do not split.

Passthrough as in "application issues directly NVMe commands" like for SG_IO
with SCSI ? Or do you mean raw block device file accesses by the application,
meaning that the IO goes through the block IO stack as opposed to directly going
to the driver ?

For the latter case, I do not think it is possible to guarantee that an IO will
not get split unless we are talking about single page IOs (e.g. 4K on X86). See
a somewhat similar request here and comments about it.

https://www.spinics.net/lists/linux-block/msg55079.html

At the moment we are doing the former, but it looks like a hack to me to
go directly to the NVMe driver.

That is what the nvme driver ioctl() is for no ? An application can send an NVMe
command directly to the driver with it. That is not a hack, but the regular way
of doing passthrough for NVMe, isn't it ?

We have enabled it through uring to get async() passthru submission.
Looks like a hack at the moment, but we might just send a RFC to have
something concrete to based the discussion on.

Yes, that would clarify things.

I was thinking that we could enable the second path by making use of
chunk_sectors and limit the I/O size just as the append_max_io_size
does. Is this the complete wrong way of looking at it?

The block layer cannot limit the size of a passthrough command since the command
is protocol specific and the block layer is a protocol independent interface.

Agree. This work depend in the application being aware of a max I/O size
at the moment. Down the road, we will remove (or at least limit a lot)
this constraint for ZNS devices that can eventually cache out-of-order
I/Os.

I/Os with a data buffer all need mapping for DMA, no matter the device
functionalities or the command being executed. With passthrough, I do not think
it is possible to have the block layer limit anything. It will likely always be
pass-or-fail. With passthrough, the application needs to understand what it is
doing.

Yes. It is definitely for applications that are implementing directly
zone-aware logic.



SCSI SG does not split passthrough requests, it cannot. For passthrough
commands, the command buffer can be dma-mapped or it cannot. If mapping
succeeds, the command is issued. If it cannot, the command is failed. At least,
that is my understanding of how the stack is working.

I am not familiar with SCSI SG. This looks like how the ioctl() passthru
works in NVMe, but as mentioned above, we would like to enable an
async() passthru path.

That is done with bsg for SCSI I believe. You may want to have a look around
there. The SG driver used to have the write() system call mapped to "issuing a
command" and read() for "getting a command result". That was removed however.
But I think bsg has a replacement for that defunct async passthrough interface.
Not sure. I have not looked at that for a while.


Thanks for the pointer; I was not aware of this. We will look into it.

Thanks again for the help Damien!
Javier



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux