Re: [PATCH 5/5] nvme: support for zoned namespaces

Javier González <javier@xxxxxxxxxxx> · Wed, 17 Jun 2020 08:18:14 +0200

On 17.06.2020 00:38, Damien Le Moal wrote:
On 2020/06/17 1:13, Javier González wrote:
On 16.06.2020 09:07, Keith Busch wrote:
On Tue, Jun 16, 2020 at 05:55:26PM +0200, Javier González wrote:
On 16.06.2020 08:48, Keith Busch wrote:
On Tue, Jun 16, 2020 at 05:02:17PM +0200, Javier González wrote:
This depends very much on how the FS / application is managing
stripping. At the moment our main use case is enabling user-space
applications submitting I/Os to raw ZNS devices through the kernel.

Can we enable this use case to start with?

I think this already provides that. You can set the nsid value to
whatever you want in the passthrough interface, so a namespace block
device is not required to issue I/O to a ZNS namespace from user space.

Mmmmm. Problem now is that the check on the nvme driver prevents the ZNS
namespace from being initialized. Am I missing something?

Hm, okay, it may not work for you. We need the driver to create at least
one namespace so that we have tags and request_queue. If you have that,
you can issue IO to any other attached namespace through the passthrough
interface, but we can't assume there is an available namespace.

That makes sense for now.

The next step for us is to enable a passthrough on uring, making sure
that I/Os do not split.

Passthrough as in "application issues directly NVMe commands" like for SG_IO
with SCSI ? Or do you mean raw block device file accesses by the application,
meaning that the IO goes through the block IO stack as opposed to directly going
to the driver ?

For the latter case, I do not think it is possible to guarantee that an IO will
not get split unless we are talking about single page IOs (e.g. 4K on X86). See
a somewhat similar request here and comments about it.

https://www.spinics.net/lists/linux-block/msg55079.html

At the moment we are doing the former, but it looks like a hack to me to
go directly to the NVMe driver.

I was thinking that we could enable the second path by making use of
chunk_sectors and limit the I/O size just as the append_max_io_size
does. Is this the complete wrong way of looking at it?

Thanks,
Javier