Re: [LSF/MM/BPF TOPIC] block drivers in user space

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/13/22 4:15 PM, Sagi Grimberg wrote:
> 
>>>>>> Actually, I'd rather have something like an 'inverse io_uring', where
>>>>>> an application creates a memory region separated into several 'ring'
>>>>>> for submission and completion.
>>>>>> Then the kernel could write/map the incoming data onto the rings, and
>>>>>> application can read from there.
>>>>>> Maybe it'll be worthwhile to look at virtio here.
>>>>>
>>>>> There is lio loopback backed by tcmu... I'm assuming that nvmet can
>>>>> hook into the same/similar interface. nvmet is pretty lean, and we
>>>>> can probably help tcmu/equivalent scale better if that is a concern...
>>>>
>>>> Sagi,
>>>>
>>>> I looked at tcmu prior to starting this work.  Other than the tcmu
>>>> overhead, one concern was the complexity of a scsi device interface
>>>> versus sending block requests to userspace.
>>>
>>> The complexity is understandable, though it can be viewed as a
>>> capability as well. Note I do not have any desire to promote tcmu here,
>>> just trying to understand if we need a brand new interface rather than
>>> making the existing one better.
>>
>> Ccing tcmu maintainer Bodo.
>>
>> We don't want to re-use tcmu's interface.
>>
>> Bodo has been looking into on a new interface to avoid issues tcmu has
>> and to improve performance. If it's allowed to add a tcmu like backend to
>> nvmet then it would be great because lio was not really made with mq and
>> perf in mind so it already starts with issues. I just started doing the
>> basics like removing locks from the main lio IO path but it seems like
>> there is just so much work.
> 
> Good to know...
> 
> So I hear there is a desire to do this. So I think we should list the
> use-cases for this first because that would lead to different design
> choices.. For example one use-case is just to send read/write/flush
> to userspace, another may want to passthru nvme commands to userspace
> and there may be others...

We might want to discuss at OLS or start a new thread.

Based on work we did for tcmu and local nbd, the issue is how complex
can handling nvme commands in the kernel get? If you want to run nvmet
on a single node then you can pass just read/write/flush to userspace
and it's not really an issue.

For tcmu/nbd the issue we are hitting is how to handle SCSI PGRs when
you are running lio on multiple nodes and the nodes export the same
LU to the same initiators. You can do it all in kernel like Bart did
for SCST and DLM
(https://blog.linuxplumbersconf.org/2015/ocw/sessions/2691.html).
However, for lio and tcmu some users didn't want pacemaker/corosync and
instead wanted to use their project's clustering or message passing
So pushing to user space is nice for these commands.

There are/were also issues with things like ALUA commands and handling
failover across nodes but I think nvme ANA avoids them. Like there
is nothing in nvme ANA like the SET_TARGET_PORT_GROUPS command which can
set the state of what would be remote ports right?



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux