Re: status of spdk

"Walker, Benjamin" <benjamin.walker@xxxxxxxxx> · Thu, 10 Nov 2016 23:54:58 +0000

On Thu, 2016-11-10 at 22:59 +0000, Sage Weil wrote:
> Hi-
> 
> Thanks, Ben-- this is super helpful!
> 
> On Thu, 10 Nov 2016, Walker, Benjamin wrote:
> > 
> > > 
> > > > 
> > > > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a
> > > > few
> > > > MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id).  
> > > > Not sure this will be feasible or not.
> > > 
> > > Unfortunately, this is not feasible without some form of partitioning
> > > support
> > > in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK
> > > bdev)
> > > - the latter is under development at the moment.
> > 
> > We are currently developing both a persistent block allocator and a very
> > lightweight, minimally featured filesystem (no directories, no permissions,
> > no
> > times). The original target for these are as the backing store of RocksDB,
> > but
> > they can be easily expanded to store other data. It isn't necessarily our
> > primary aim to incorporate this into Ceph, but it clearly fits into the Ceph
> > internals very well. We haven't provided a timeline on open sourcing this,
> > but
> > we're actively writing the code now.
> 
> This may not actually be that helpful.  It's basically what BlueFS is 
> already doing (it's a rocksdb::Env that implements minimal "file system" 
> and shares the device with teh rest of BlueStore).

Understood - this is pretty much BlueFS + BlueStore in a standalone format. On
the surface it seems like it's duplicated work, but we are very heavily focused
on solid state media (particularly, next generation media beyond NAND) and that
has led us to diverge quite a bit in design from BlueStore. If our work ends up
benefiting Ceph in some way in the longer term, that's great, but I understand
Ceph already has code doing somewhat similar things.

> 
> What this point is really about is more operational than anything.  
> Currently disks (HDDs or SSDs) can be easily swapped between machines 
> because they have GPT partition labels and udev rules to run 'ceph-disk 
> trigger' on them.  That basically mounts the tagged partition to a 
> temporary location, figures out which OSD it is, bind mounts it to the 
> appropriate /var/lib/ceph/osd/* directory, and then starts up the process.  
> With BlueStore there are just a handful of metadata/bootstrap files here 
> to get the OSD started:
> 
> -rw-r--r-- 1 sage sage           2 Nov  9 11:40 bluefs
> -rw-r--r-- 1 sage sage          37 Nov  9 11:40 ceph_fsid
> -rw-r--r-- 1 sage sage          37 Nov  9 11:40 fsid
> -rw------- 1 sage sage          56 Nov  9 11:40 keyring
> -rw-r--r-- 1 sage sage           8 Nov  9 11:40 kv_backend
> -rw-r--r-- 1 sage sage          21 Nov  9 11:40 magic
> -rw-r--r-- 1 sage sage           4 Nov  9 11:40 mkfs_done
> -rw-r--r-- 1 sage sage           6 Nov  9 11:40 ready
> -rw-r--r-- 1 sage sage          10 Nov  9 11:40 type
> -rw-r--r-- 1 sage sage           2 Nov  9 11:40 whoami
> 
> plus a symlink for block, block.db, and block.wal to the other partitions 
> or devices with the actual block data.
> 
> With SPDK, we can't carve out a partition or label it, a certainly can't 
> mount it, so we'll need to rethink the bootstrapping process.  Fortunatley 
> that can be wrapped up reasonably neatly in the 'ceph-disk activate' 
> function, but eventually we'll need to decide how to storage/manage this 
> metadata about the device.

This sounds like a solvable problem to me. An OSD using BlueStore uses a block
device that has a one GPT partition with a filesystem (XFS?) that contains the
above bootstrapping data, plus some number of other GPT partitions with no
filesystems that are used for everything else, right? I think there are two
changes that could be made here. First, the bootstrap partition needs to contain
a BlueStore/Ceph-specific formatted data layout instead of using a kernel
filesystem. Maybe it could even be simpler and just have a flat binary layout
containing the above files sequentially or something.

Second, the BlueStore SPDK backend needs to comprehend real GPT partition
metadata (this part is not particularly hard - GPT is simple). That way, the
disk format between OSDs using SPDK and those using the kernel are identical and
SPDK respects the partitions and can locate them by partition label. Once
they're identical, Ceph can simply load using the GPT partition label and udev
mechanism as it does today, then dynamically unbind the kernel nvme driver from
the device (you just write to sysfs) and load SPDK in its place. Because the
SPDK backend is expecting the same disk format as the kernel, it will load
without issue.

I think this probably solves a few of the other pain points of using Ceph with
SPDK too around configuration. With this strategy all you have to do is flag the
OSD to use SPDK with no other configuration changes (well, maybe the number of
hugepages and which cores are allowed). This is because most of the
configuration for the disks is around specifying which data is where, and that
seems to be done by GPT partition label which the SPDK backend would now
comprehend.

> 
> Or just forget about easy hot swapping and stick these files on the 
> hosts root partition.
> 
> > 
> > > 
> > > The limitation that Orlando identified is, not being able to launch
> > > multiple
> > > SPDK-based OSDs on a node today.  Igor and Orlando's PR (16966) is to add
> > > a
> > > config option to limit the number of hugepages assigned to an OSD via an
> > > EAL
> > > switch.  The other limitation today is being able to specify the CPU mask
> > > assigned to each OSD which would require config addition.
> > > 
> > 
> > To clarify, DPDK requires a process to declare the amount of memory
> > (hugepages)
> > and which CPU cores the process will use up front. If you want to run
> > multiple
> > DPDK-based processes on the same system, you just have to make sure there
> > are
> > enough hugepages and that the cores you specify don't overlap. That PR is
> > just
> > making it so you can configure these values. I just wanted to clarify that
> > there
> > isn't any deeper technical problem with running multiple DPDK processes on
> > the
> > same system - you all probably knew that but it's best to be clear. 
> > 
> > Also, DPDK uses hugepages because it's the only good way to get "pinned"
> > memory
> > in userspace that userspace drivers can DMA into and out of. That's because
> > the
> > kernel doesn't page out or move around hugepages (hugepages also happen to
> > be
> > more efficient TLB-wise given that the data buffers are often large
> > transfers).
> > There is some work on vfio-pci in the kernel that may provide a better
> > solution
> > in the long term, but I'm not totally up to speed on that. Because data must
> > reside in hugepages currently, all buffers sent to the SPDK backend for
> > BlueStore are copied from wherever they are into a buffer from a pool
> > allocated
> > out of hugepages. It would be better if all data buffers were originally
> > allocated from hugepage memory, but that's a bigger change to Ceph of
> > course.
> > Note that incoming packets from DPDK will also reside in hugepages upon DMA
> > from
> > the NIC, which would be convenient except that almost all NVMe devices today
> > don't support fully flexible scatter-gather specification of buffers and you
> > end
> > up forced to copy simply to satisfy the alignment requirements of the DMA
> > engine. Some day though!
> 
> Yeah, we definitely want to get there eventually.  When Ceph sends data 
> over the wire it is preceded by a header that includes an alignment 
> so that (with TCP currently) we read data off the socket into 
> properly aligned memory.  That way we can eventually do O_DIRECT writes 
> with it.  If it's possible to direct what memory the DPDK data comes into 
> we can hopefully do something similar here...  The rest of Ceph's 
> bufferlist library should be flexible enough to enable zero-copy.
> 
> > 
> > Sorry to be so long-winded, but I'm happy to help with SPDK.
> 
> That's great to hear--this was very helpful for me!
> 
> Thanks-
> sage
> 
> 
> 
> 
> > 
> > 
> > Thanks,
> > Ben
> > 
> > > 
> > > Tushar 
> > > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx 
> > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Dong Wu
> > > > Sent: Tuesday, November 8, 2016 7:45 PM
> > > > To: LIU, Fei <james.liu@xxxxxxxxxxxxxxx>
> > > > Cc: Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx>; Sage Weil 
> > > > <sweil@xxxxxxxxxx>; Wang, Haomai <haomaiwang@xxxxxxxxx>; ceph-devel 
> > > > <ceph-devel@xxxxxxxxxxxxxxx>
> > > > Subject: Re: status of spdk
> > > > 
> > > > Hi, Yehuda and Haomai,
> > > >     DPDK backend may have the same problem. I had tried to use
> > > > haomai's  PR:
> > > > https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed
> > > > to
> > > > start multiple OSDs on the host with only one network card, alse i read
> > > > about the dpdk multi-process support:
> > > > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did
> > > > not
> > > > find any config  to set multi-process support. Anything wrong or multi-
> > > > process support not been implemented?
> > > > 
> > > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@xxxxxxxxxxxxxxx>:
> > > > > 
> > > > > 
> > > > > Hi Yehuda and Haomai,
> > > > >    The issue of drives driven by SPDK is not able to be shared by
> > > > > multiple
> > > > > OSDs as kernel NVMe drive since SPDK as a process so far can not be
> > > > > shared
> > > > > across multiple processes like OSDs, right?
> > > > > 
> > > > >    Regards,
> > > > >    James
> > > > > 
> > > > > 
> > > > > 
> > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@xxxxxxx
> > > > > rnel
> > > > > .org on behalf of yehuda@xxxxxxxxxx> wrote:
> > > > > 
> > > > >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@xxxxxxxxxx>
> > > > > wrote:
> > > > >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> > > > >     >> I just started looking at spdk, and have a few comments and
> > > > > questions.
> > > > >     >>
> > > > >     >> First, it's not clear to me how we should handle build. At the
> > > > > moment
> > > > >     >> the spdk code resides as a submodule in the ceph tree, but it
> > > > > depends
> > > > >     >> on dpdk, which currently needs to be downloaded separately. We
> > > > > can
> > > > > add
> > > > >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That
> > > > > been
> > > > >     >> said, getting it to build was a bit tricky and I think it might
> > > > > be
> > > > >     >> broken with cmake. In order to get it working I resorted to
> > > > > building a
> > > > >     >> system library and use that.
> > > > >     >
> > > > >     > Note that this PR is about to merge
> > > > >     >
> > > > >     >         https://github.com/ceph/ceph/pull/10748
> > > > >     >
> > > > >     > which adds the DPDK submodule, so hopefully this issue will go
> > > > > away
> > > > > when
> > > > >     > that merged or with a follow-on cleanup.
> > > > >     >
> > > > >     >> The way to currently configure an osd to use bluestore with
> > > > > spdk is
> > > > > by
> > > > >     >> creating a symbolic link that replaces the bluestore 'block'
> > > > > device
> > > > > to
> > > > >     >> point to a file that has a name that is prefixed with 'spdk:'.
> > > > >     >> Originally I assumed that the suffix would be the nvme device
> > > > > id,
> > > > > but
> > > > >     >> it seems that it's not really needed, however, the file itself
> > > > > needs
> > > > >     >> to contain the device id (see
> > > > >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a
> > > > > couple
> > > > > of
> > > > >     >> minor fixes).
> > > > >     >
> > > > >     > Open a PR for those?
> > > > > 
> > > > >     Sure
> > > > > 
> > > > >     >
> > > > >     >> As I understand it, in order to support multiple osds on the
> > > > > same
> > > > > NVMe
> > > > >     >> device we have a few options. We can leverage NVMe namespaces,
> > > > > but
> > > > >     >> that's not supported on all devices. We can configure bluestore
> > > > > to
> > > > >     >> only use part of the device (device sharding? not sure if it
> > > > > supports
> > > > >     >> it). I think it's best if we could keep bluestore out of the
> > > > > loop
> > > > >     >> there and have the NVMe driver abstract multiple partitions of
> > > > > the
> > > > >     >> NVMe device. The idea is to be able to define multiple
> > > > > partitions
> > > > > on
> > > > >     >> the device (e.g., each partition will be defined by the offset,
> > > > > size,
> > > > >     >> and namespace), and have the osd set to use a specific
> > > > > partition.
> > > > >     >> We'll probably need a special tool to manage it, and
> > > > > potentially
> > > > > keep
> > > > >     >> the partition table information on the device itself. The tool
> > > > > could
> > > > >     >> also manage the creation of the block link. We should probably
> > > > > rethink
> > > > >     >> how the link is structure and what it points at.
> > > > >     >
> > > > >     > I agree that bluestore shouldn't get involved.
> > > > >     >
> > > > >     > Is the NVMe namespaces meant to support multiple processes
> > > > > sharing
> > > > > the
> > > > >     > same hardware device?
> > > > > 
> > > > >     More of a partitioning solution, but yes (as far as I undestand).
> > > > > 
> > > > >     >
> > > > >     > Also, if you do that, is it possible to give one of the
> > > > > namespaces
> > > > > to the
> > > > >     > kernel?  That might solve the bootstrapping problem we 
> > > > > currently have
> > > > > 
> > > > >     Theoretically, but not right now (or ever?). See here:
> > > > > 
> > > > >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
> > > > > 
> > > > >     > where we have nowhere to put the $osd_data filesystem with the
> > > > > device
> > > > >     > metadata.  (This is admittedly not necessarily a blocking
> > > > > issue.  Putting
> > > > >     > those dirs on / wouldn't be the end of the world; it just means
> > > > > cards
> > > > >     > can't be easily moved between boxes.)
> > > > >     >
> > > > > 
> > > > >     Maybe we can use bluestore for these too ;) that been said, there
> > > > >     might be some kind of a loopback solution that could work, but not
> > > > >     sure if it won't create major bottlenecks that we'd want to avoid.
> > > > > 
> > > > >     Yehuda
> > > > >     --
> > > > >     To unsubscribe from this list: send the line "unsubscribe ceph-
> > > > > devel"
> > > > > in
> > > > >     the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > >     More majordomo info at  
> > > > > http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the
> > > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info
> > > at  http://
> > > vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?ʇڙ??j??f???h??????w???
> > ???j:+v???w????????????zZ+???????j"????i
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f