Re: status of spdk

"Walker, Benjamin" <benjamin.walker@xxxxxxxxx> · Thu, 10 Nov 2016 22:39:12 +0000

On Wed, 2016-11-09 at 21:10 +0000, Gohad, Tushar wrote:
> > 
> > > 
> > > Multiple DPDK/SPDK instances on a single host does not work because 
> > > the current implementation in Ceph does not support it. This issue is 
> > > tracked here: http://tracker.ceph.com/issues/16966 There is 
> > > multi-process support in DPDK, but you must configure the EAL 
> > > correctly for it to work. I have been working on a patch, 
> > > https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user 
> > > to configure multiple BlueStore OSDs backed by SPDK. Though this patch 
> > > works, I think it needs a few additions to actually make it performant.
> > > This is just to get the 1 OSD process per NVMe case working. A 
> > > multi-OSD per NVMe solution will probably require more work as 
> > > described in this thread.
> 
> > 
> > TBH I'm not sure how important the multi-osd per NVMe case is.  
> > The only reason to do that would be performance bottlenecks within the 
> > OSD itself, and I'd rather focus our efforts on eliminating those than on 
> > enabling a bandaid solution.
> 
> Completely agree here.

I'm not a Ceph expert (I'm the technical lead for SPDK), but I echo this
sentiment 1000x. Even the fastest NVMe device can only do <1M 4k I/Ops which is
a very modest number in terms of CPU time, so there is no technical reason that
a single OSD can't saturate that. I understand that the OSDs of today aren't
able to achieve that level of performance, but I'm optimistic a concerted long-
term effort involving the experts could make it happen.

I'd also like to explain a few things about NVMe to clear up some confusion I
saw earlier in this thread. NVMe devices are composed of three major primitives
- a singleton controller, some set of namespaces, and a set of queues. The
namespaces are constructs on the SSD itself, and they're basically contiguous
sets of logical blocks within a single NVMe controller. The vast majority of
SSDs support exactly 1 namespace and I don't expect that to change going
forward. The singular NVMe controller is what the NVMe driver is loaded against,
so you can either have SPDK loaded or the kernel - you can't mix and match or
split namespaces, etc. 

NVMe also exposes a set of queues on which I/O requests can be submitted. These
queues can submit an I/O request to any namespace on the device and there is no
way to enforce particular queues mapping to particular namespaces. Therefore,
namespaces aren't that valuable as a mechanism for sharing the drive - you
basically still have to have a software/driver layer verifying that everyone is
keeping their requests separate (the namespace mechanism is there so that the
media can be formatted in different ways - i.e. different block sizes,
additional metadata, etc.). SPDK exposes these queues to the user so that
applications can submit I/O on each queue entirely locklessly and with no
coordination. Unfortunately, the version of SPDK currently in use by BlueStore
is ancient and the queues are all implicit still. It probably doesn't matter for
performance, since the BlueStore SPDK backend only sends I/O from a single
thread, which means it is using just a single queue. NVMe devices almost
universally can get their full performance using a single queue, so multiple
queues is only useful for the application software to submit I/O from many
threads simultaneously without locking (which BlueStore is not doing).

The SPDK NVMe driver unbinds the nvme driver in the kernel, then maps the NVMe
controller registers into a userspace process, so only that process has access
to the device. We're currently modifying the driver to allocate the critical
structures in shared memory so certain parts can be mapped by secondary
processes. This does allow for some level of multi-process support. We mostly
intended this for use with management tools like nvme-cli - they can attach to
the main process and send some management commands and then detach. I'm not sure
this is a great solution for sharing an NVMe device across multiple primary OSD
processes though. We can definitely do something in this space to create a
daemon process that owns the device and allows other processes to attach and
allocate queues, but like I said above I think the effort is best spent on
making the OSD faster.

Further, given what the NVMe hardware is actually capable of, I think the right
solution for sharing an NVMe device within a process is to write a partition
layer in software based on standard GPT partitioning. That could sit on top of
the NVMe driver and do the enforcement of which parts can write to which logical
blocks on the SSD. This would be the best way forward if the Ceph community
pursues multiple OSDs in a single process (again, I think the time should be
spent making one OSD fast enough to saturate one SSD instead).

> 
> > 
> > As I understand it the scenarios that are most interesting are
> 
> > 
> > 1- sharing the same network device to multiple osds with DPDK (this will
> > presumably be pretty common unless/until we combine many OSDs into a single
> > process), and

I believe the best path forward here is SR-IOV hardware support in the NICs. I
don't know what the state of the hardware on the NIC side is here, but I think
SR-IOV is commonly available already on the network side.

> 
> > 
> > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few
> > MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id).  
> > Not sure this will be feasible or not.
> 
> Unfortunately, this is not feasible without some form of partitioning support
> in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK bdev)
> - the latter is under development at the moment.

We are currently developing both a persistent block allocator and a very
lightweight, minimally featured filesystem (no directories, no permissions, no
times). The original target for these are as the backing store of RocksDB, but
they can be easily expanded to store other data. It isn't necessarily our
primary aim to incorporate this into Ceph, but it clearly fits into the Ceph
internals very well. We haven't provided a timeline on open sourcing this, but
we're actively writing the code now.

> 
> The limitation that Orlando identified is, not being able to launch multiple
> SPDK-based OSDs on a node today.  Igor and Orlando's PR (16966) is to add a
> config option to limit the number of hugepages assigned to an OSD via an EAL
> switch.  The other limitation today is being able to specify the CPU mask
> assigned to each OSD which would require config addition.
> 

To clarify, DPDK requires a process to declare the amount of memory (hugepages)
and which CPU cores the process will use up front. If you want to run multiple
DPDK-based processes on the same system, you just have to make sure there are
enough hugepages and that the cores you specify don't overlap. That PR is just
making it so you can configure these values. I just wanted to clarify that there
isn't any deeper technical problem with running multiple DPDK processes on the
same system - you all probably knew that but it's best to be clear. 

Also, DPDK uses hugepages because it's the only good way to get "pinned" memory
in userspace that userspace drivers can DMA into and out of. That's because the
kernel doesn't page out or move around hugepages (hugepages also happen to be
more efficient TLB-wise given that the data buffers are often large transfers).
There is some work on vfio-pci in the kernel that may provide a better solution
in the long term, but I'm not totally up to speed on that. Because data must
reside in hugepages currently, all buffers sent to the SPDK backend for
BlueStore are copied from wherever they are into a buffer from a pool allocated
out of hugepages. It would be better if all data buffers were originally
allocated from hugepage memory, but that's a bigger change to Ceph of course.
Note that incoming packets from DPDK will also reside in hugepages upon DMA from
the NIC, which would be convenient except that almost all NVMe devices today
don't support fully flexible scatter-gather specification of buffers and you end
up forced to copy simply to satisfy the alignment requirements of the DMA
engine. Some day though!

Sorry to be so long-winded, but I'm happy to help with SPDK.

Thanks,
Ben

> Tushar 
> 
> 
> > 
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@xxxxxxxxxxxxxxx 
> > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Dong Wu
> > Sent: Tuesday, November 8, 2016 7:45 PM
> > To: LIU, Fei <james.liu@xxxxxxxxxxxxxxx>
> > Cc: Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx>; Sage Weil 
> > <sweil@xxxxxxxxxx>; Wang, Haomai <haomaiwang@xxxxxxxxx>; ceph-devel 
> > <ceph-devel@xxxxxxxxxxxxxxx>
> > Subject: Re: status of spdk
> > 
> > Hi, Yehuda and Haomai,
> >     DPDK backend may have the same problem. I had tried to use haomai's  PR:
> > https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to
> > start multiple OSDs on the host with only one network card, alse i read
> > about the dpdk multi-process support:
> > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not
> > find any config  to set multi-process support. Anything wrong or multi-
> > process support not been implemented?
> > 
> > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@xxxxxxxxxxxxxxx>:
> > > 
> > > Hi Yehuda and Haomai,
> > >    The issue of drives driven by SPDK is not able to be shared by multiple
> > > OSDs as kernel NVMe drive since SPDK as a process so far can not be shared
> > > across multiple processes like OSDs, right?
> > > 
> > >    Regards,
> > >    James
> > > 
> > > 
> > > 
> > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel
> > > .org on behalf of yehuda@xxxxxxxxxx> wrote:
> > > 
> > >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> > >     >> I just started looking at spdk, and have a few comments and
> > > questions.
> > >     >>
> > >     >> First, it's not clear to me how we should handle build. At the
> > > moment
> > >     >> the spdk code resides as a submodule in the ceph tree, but it
> > > depends
> > >     >> on dpdk, which currently needs to be downloaded separately. We can
> > > add
> > >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That
> > > been
> > >     >> said, getting it to build was a bit tricky and I think it might be
> > >     >> broken with cmake. In order to get it working I resorted to
> > > building a
> > >     >> system library and use that.
> > >     >
> > >     > Note that this PR is about to merge
> > >     >
> > >     >         https://github.com/ceph/ceph/pull/10748
> > >     >
> > >     > which adds the DPDK submodule, so hopefully this issue will go away
> > > when
> > >     > that merged or with a follow-on cleanup.
> > >     >
> > >     >> The way to currently configure an osd to use bluestore with spdk is
> > > by
> > >     >> creating a symbolic link that replaces the bluestore 'block' device
> > > to
> > >     >> point to a file that has a name that is prefixed with 'spdk:'.
> > >     >> Originally I assumed that the suffix would be the nvme device id,
> > > but
> > >     >> it seems that it's not really needed, however, the file itself
> > > needs
> > >     >> to contain the device id (see
> > >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple
> > > of
> > >     >> minor fixes).
> > >     >
> > >     > Open a PR for those?
> > > 
> > >     Sure
> > > 
> > >     >
> > >     >> As I understand it, in order to support multiple osds on the same
> > > NVMe
> > >     >> device we have a few options. We can leverage NVMe namespaces, but
> > >     >> that's not supported on all devices. We can configure bluestore to
> > >     >> only use part of the device (device sharding? not sure if it
> > > supports
> > >     >> it). I think it's best if we could keep bluestore out of the loop
> > >     >> there and have the NVMe driver abstract multiple partitions of the
> > >     >> NVMe device. The idea is to be able to define multiple partitions
> > > on
> > >     >> the device (e.g., each partition will be defined by the offset,
> > > size,
> > >     >> and namespace), and have the osd set to use a specific partition.
> > >     >> We'll probably need a special tool to manage it, and potentially
> > > keep
> > >     >> the partition table information on the device itself. The tool
> > > could
> > >     >> also manage the creation of the block link. We should probably
> > > rethink
> > >     >> how the link is structure and what it points at.
> > >     >
> > >     > I agree that bluestore shouldn't get involved.
> > >     >
> > >     > Is the NVMe namespaces meant to support multiple processes sharing
> > > the
> > >     > same hardware device?
> > > 
> > >     More of a partitioning solution, but yes (as far as I undestand).
> > > 
> > >     >
> > >     > Also, if you do that, is it possible to give one of the namespaces
> > > to the
> > >     > kernel?  That might solve the bootstrapping problem we 
> > > currently have
> > > 
> > >     Theoretically, but not right now (or ever?). See here:
> > > 
> > >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
> > > 
> > >     > where we have nowhere to put the $osd_data filesystem with the
> > > device
> > >     > metadata.  (This is admittedly not necessarily a blocking
> > > issue.  Putting
> > >     > those dirs on / wouldn't be the end of the world; it just means
> > > cards
> > >     > can't be easily moved between boxes.)
> > >     >
> > > 
> > >     Maybe we can use bluestore for these too ;) that been said, there
> > >     might be some kind of a loopback solution that could work, but not
> > >     sure if it won't create major bottlenecks that we'd want to avoid.
> > > 
> > >     Yehuda
> > >     --
> > >     To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in
> > >     the body of a message to majordomo@xxxxxxxxxxxxxxx
> > >     More majordomo info at  
> > > http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://
> vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f