Re: [LSF/MM/BPF TOPIC] block drivers in user space

Sagi Grimberg <sagi@xxxxxxxxxxx> · Tue, 15 Mar 2022 10:03:50 +0200

On 3/14/22 19:12, Mike Christie wrote:
On 3/13/22 4:15 PM, Sagi Grimberg wrote:

Actually, I'd rather have something like an 'inverse io_uring', where
an application creates a memory region separated into several 'ring'
for submission and completion.
Then the kernel could write/map the incoming data onto the rings, and
application can read from there.
Maybe it'll be worthwhile to look at virtio here.

There is lio loopback backed by tcmu... I'm assuming that nvmet can
hook into the same/similar interface. nvmet is pretty lean, and we
can probably help tcmu/equivalent scale better if that is a concern...

Sagi,

I looked at tcmu prior to starting this work.  Other than the tcmu
overhead, one concern was the complexity of a scsi device interface
versus sending block requests to userspace.

The complexity is understandable, though it can be viewed as a
capability as well. Note I do not have any desire to promote tcmu here,
just trying to understand if we need a brand new interface rather than
making the existing one better.

Ccing tcmu maintainer Bodo.

We don't want to re-use tcmu's interface.

Bodo has been looking into on a new interface to avoid issues tcmu has
and to improve performance. If it's allowed to add a tcmu like backend to
nvmet then it would be great because lio was not really made with mq and
perf in mind so it already starts with issues. I just started doing the
basics like removing locks from the main lio IO path but it seems like
there is just so much work.

Good to know...

So I hear there is a desire to do this. So I think we should list the
use-cases for this first because that would lead to different design
choices.. For example one use-case is just to send read/write/flush
to userspace, another may want to passthru nvme commands to userspace
and there may be others...

We might want to discuss at OLS or start a new thread.

Based on work we did for tcmu and local nbd, the issue is how complex
can handling nvme commands in the kernel get? If you want to run nvmet
on a single node then you can pass just read/write/flush to userspace
and it's not really an issue.

As I said, I can see other use-cases that may want raw nvme commands
in a backend userspace driver...

For tcmu/nbd the issue we are hitting is how to handle SCSI PGRs when
you are running lio on multiple nodes and the nodes export the same
LU to the same initiators. You can do it all in kernel like Bart did
for SCST and DLM
(https://blog.linuxplumbersconf.org/2015/ocw/sessions/2691.html).
However, for lio and tcmu some users didn't want pacemaker/corosync and
instead wanted to use their project's clustering or message passing
So pushing to user space is nice for these commands.

For this use-case we'd probably want to scan the config knobs to see
that we have what's needed (I think we should have enough to enable this
use-case).

There are/were also issues with things like ALUA commands and handling
failover across nodes but I think nvme ANA avoids them. Like there
is nothing in nvme ANA like the SET_TARGET_PORT_GROUPS command which can
set the state of what would be remote ports right?

Right.