Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > 
> > Ming Lei <ming.lei@xxxxxxxxxx> writes:
> > 
> > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > >> 
> > >> Hi Ming,
> > >> 
> > >> Ming Lei <ming.lei@xxxxxxxxxx> writes:
> > >> 
> > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > >> >> > > > > > Hello,
> > >> >> > > > > > 
> > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > >> >> > > > > 
> > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > >> >> > > > 
> > >> >> > > > Thanks for the thoughts, :-)
> > >> >> > > > 
> > >> >> > > > > 
> > >> >> > > > > > 
> > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > >> >> > > > > > 
> > >> >> > > > > > - for fast prototype or performance evaluation
> > >> >> > > > > > 
> > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > >> >> > > > > 
> > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > >> >> > > > > What am I missing?
> > >> >> > > > 
> > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > >> >> > > > the case of scsi and nvme.
> > >> >> > > 
> > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > >> >> > > 
> > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > >> >> > > where it prevents a single ublk server process from handling multiple
> > >> >> > > devices.
> > >> >> > > 
> > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > >> >> > > userspace.
> > >> >> > > 
> > >> >> > > I don't understand yet...
> > >> >> > 
> > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > >> >> > device is independent, and can't shard tags.
> > >> >> 
> > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > >> >> it just sub-optimal?
> > >> >
> > >> > It is former, ublk can't support multiple devices which share single host
> > >> > because duplicated tag can be seen in host side, then io is failed.
> > >> >
> > >> 
> > >> I have trouble following this discussion. Why can we not handle multiple
> > >> block devices in a single ublk user space process?
> > >> 
> > >> From this conversation it seems that the limiting factor is allocation
> > >> of the tag set of the virtual device in the kernel? But as far as I can
> > >> tell, the tag sets are allocated per virtual block device in
> > >> `ublk_ctrl_add_dev()`?
> > >> 
> > >> It seems to me that a single ublk user space process shuld be able to
> > >> connect to multiple storage devices (for instance nvme-of) and then
> > >> create a ublk device for each namespace, all from a single ublk process.
> > >> 
> > >> Could you elaborate on why this is not possible?
> > >
> > > If the multiple storages devices are independent, the current ublk can
> > > handle them just fine.
> > >
> > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > share single host, and use host-wide tagset, the current interface can't
> > > work as expected, because tags is shared among all these devices. The
> > > current ublk interface needs to be extended for covering this case.
> > 
> > Thanks for clarifying, that is very helpful.
> > 
> > Follow up question: What would the implications be if one tried to
> > expose (through ublk) each nvme namespace of an nvme-of controller with
> > an independent tag set?
> 
> https://lore.kernel.org/linux-block/877cwhrgul.fsf@xxxxxxxxxxxx/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> 
> > What are the benefits of sharing a tagset across
> > all namespaces of a controller?
> 
> The userspace implementation can be simplified a lot since generic
> shared tag allocation isn't needed, meantime with good performance
> (shared tags allocation in SMP is one hard problem)

In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
shared tags across multiple SQs in NVMe. So userspace doesn't need an
SMP tag allocator in the first place:
- Each ublk server thread has a separate io_uring context.
- Each ublk server thread has its own NVMe Submission Queue.
- Therefore it's trivial and cheap to allocate NVMe CIDs in userspace
  because there are no SMP concerns.

The issue isn't tag allocation, it's the fact that the kernel block
layer submits requests to userspace that don't fit into the NVMe
Submission Queue because multiple devices that appear independent from
the kernel perspective are sharing a single NVMe Submission Queue.
Userspace needs a basic I/O scheduler to ensure fairness across devices.
Round-robin for example. There are no SMP concerns here either.

So I don't buy the argument that userspace would have to duplicate the
tag allocation code from Linux because that solves a different problem
that the ublk server doesn't have.

If the kernel is aware of tag sharing, then userspace doesn't have to do
(trivial) tag allocation or I/O scheduling. It can simply stuff ublk io
commands into NVMe queues without thinking, which wastes fewer CPU
cycles and is a little simpler.

> The extension shouldn't be very hard, follows some raw ideas:

It is definitely nice for the ublk server to tell the kernel about
shared resources so the Linux block layer has the best information. I
think it's a good idea to add support for that. I just disagree with
some of the statements you've made about why and especially the claim
that ublk doesn't support multiple device servers today.

> 
> 1) interface change
> 
> - add new feature flag of UBLK_F_SHARED_HOST, multiple ublk
>   devices(ublkcXnY) are attached to the ublk host(ublkhX)
> 
> - dev_info.dev_id: in case of UBLK_F_SHARED_HOST, the top 16bit stores
>   host id(X), and the bottom 16bit stores device id(Y)
> 
> - add two control commands: UBLK_CMD_ADD_HOST, UBLK_CMD_DEL_HOST
> 
>   Still sent to /dev/ublk-control
> 
>   ADD_HOST command will allocate one host device(char) with specified host
>   id or allocated host id, tag_set is allocated as host resource. The
>   host device(ublkhX) will become parent of all ublkcXn*
> 
>   Before sending DEL_HOST, all devices attached to this host have to
>   be stopped & removed first, otherwise DEL_HOST won't succeed.
> 
> - keep other interfaces not changed
>   in case of UBLK_F_SHARED_HOST, userspace has to set correct
>   dev_info.dev_id.host_id, so ublk driver can associate device with
>   specified host
> 
> 2) implementation
> - host device(ublkhX) becomes parent of all ublk char devices of
>   ublkcXn*
> 
> - except for tagset, other per-host resource abstraction? Looks not
>   necessary, anything is available in userspace
> 
> - host-wide error handling, maybe all devices attached to this host
>   need to be recovered, so it should be done in userspace 
> 
> - per-host admin queue, looks not necessary, given host related
>   management/control tasks are done in userspace directly
> 
> - others?
> 
> 
> Thanks,
> Ming
> 

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux