On 2020/06/17 16:11, Javier González wrote: > On 17.06.2020 06:54, Damien Le Moal wrote: >> On 2020/06/17 15:18, Javier González wrote: >>> On 17.06.2020 00:38, Damien Le Moal wrote: >>>> On 2020/06/17 1:13, Javier González wrote: >>>>> On 16.06.2020 09:07, Keith Busch wrote: >>>>>> On Tue, Jun 16, 2020 at 05:55:26PM +0200, Javier González wrote: >>>>>>> On 16.06.2020 08:48, Keith Busch wrote: >>>>>>>> On Tue, Jun 16, 2020 at 05:02:17PM +0200, Javier González wrote: >>>>>>>>> This depends very much on how the FS / application is managing >>>>>>>>> stripping. At the moment our main use case is enabling user-space >>>>>>>>> applications submitting I/Os to raw ZNS devices through the kernel. >>>>>>>>> >>>>>>>>> Can we enable this use case to start with? >>>>>>>> >>>>>>>> I think this already provides that. You can set the nsid value to >>>>>>>> whatever you want in the passthrough interface, so a namespace block >>>>>>>> device is not required to issue I/O to a ZNS namespace from user space. >>>>>>> >>>>>>> Mmmmm. Problem now is that the check on the nvme driver prevents the ZNS >>>>>>> namespace from being initialized. Am I missing something? >>>>>> >>>>>> Hm, okay, it may not work for you. We need the driver to create at least >>>>>> one namespace so that we have tags and request_queue. If you have that, >>>>>> you can issue IO to any other attached namespace through the passthrough >>>>>> interface, but we can't assume there is an available namespace. >>>>> >>>>> That makes sense for now. >>>>> >>>>> The next step for us is to enable a passthrough on uring, making sure >>>>> that I/Os do not split. >>>> >>>> Passthrough as in "application issues directly NVMe commands" like for SG_IO >>>> with SCSI ? Or do you mean raw block device file accesses by the application, >>>> meaning that the IO goes through the block IO stack as opposed to directly going >>>> to the driver ? >>>> >>>> For the latter case, I do not think it is possible to guarantee that an IO will >>>> not get split unless we are talking about single page IOs (e.g. 4K on X86). See >>>> a somewhat similar request here and comments about it. >>>> >>>> https://www.spinics.net/lists/linux-block/msg55079.html >>> >>> At the moment we are doing the former, but it looks like a hack to me to >>> go directly to the NVMe driver. >> >> That is what the nvme driver ioctl() is for no ? An application can send an NVMe >> command directly to the driver with it. That is not a hack, but the regular way >> of doing passthrough for NVMe, isn't it ? > > We have enabled it through uring to get async() passthru submission. > Looks like a hack at the moment, but we might just send a RFC to have > something concrete to based the discussion on. Yes, that would clarify things. >>> I was thinking that we could enable the second path by making use of >>> chunk_sectors and limit the I/O size just as the append_max_io_size >>> does. Is this the complete wrong way of looking at it? >> >> The block layer cannot limit the size of a passthrough command since the command >> is protocol specific and the block layer is a protocol independent interface. > > Agree. This work depend in the application being aware of a max I/O size > at the moment. Down the road, we will remove (or at least limit a lot) > this constraint for ZNS devices that can eventually cache out-of-order > I/Os. I/Os with a data buffer all need mapping for DMA, no matter the device functionalities or the command being executed. With passthrough, I do not think it is possible to have the block layer limit anything. It will likely always be pass-or-fail. With passthrough, the application needs to understand what it is doing. > >> SCSI SG does not split passthrough requests, it cannot. For passthrough >> commands, the command buffer can be dma-mapped or it cannot. If mapping >> succeeds, the command is issued. If it cannot, the command is failed. At least, >> that is my understanding of how the stack is working. > > I am not familiar with SCSI SG. This looks like how the ioctl() passthru > works in NVMe, but as mentioned above, we would like to enable an > async() passthru path. That is done with bsg for SCSI I believe. You may want to have a look around there. The SG driver used to have the write() system call mapped to "issuing a command" and read() for "getting a command result". That was removed however. But I think bsg has a replacement for that defunct async passthrough interface. Not sure. I have not looked at that for a while. > > Thanks, > Javier > -- Damien Le Moal Western Digital Research