Re: single-threaded seastar-osd

kefu chai <tchaikov@xxxxxxxxx> · Wed, 9 Jan 2019 12:54:06 +0800

On Wed, Jan 9, 2019 at 10:18 AM Ma, Jianpeng <jianpeng.ma@xxxxxxxxx> wrote:
>
> > -----Original Message-----
> > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > owner@xxxxxxxxxxxxxxx] On Behalf Of Radoslaw Zarzynski
> > Sent: Wednesday, January 9, 2019 9:31 AM
> > To: Mark Nelson <mnelson@xxxxxxxxxx>
> > Cc: Sage Weil <sweil@xxxxxxxxxx>; kefu chai <tchaikov@xxxxxxxxx>; The
> > Esoteric Order of the Squid Cybernetic <ceph-devel@xxxxxxxxxxxxxxx>; Cheng,
> > Yingxin <yingxin.cheng@xxxxxxxxx>
> > Subject: Re: single-threaded seastar-osd
> >
> > On Tue, Jan 8, 2019 at 12:43 AM Mark Nelson <mnelson@xxxxxxxxxx>
> > wrote:
> > >
> > > I want to know what an OSD means in this context.
> >
> > Let me start with bringing more context on how the concept was born.
> > It came just from the observation that vendors tend to deploy multiple ceph-
> > osd daemons on a single NVMe device in their performance testing.
> > It's not unusual to see 48 physical cores serving 10 NVMes with 2 OSDs on
> > each like in the Micron's document [1]. This translates into 2.4 physical cores
> > per ceph-osd.
> >
>  Because bluestore can't fully utilize NVME, it will cause partitions to be used.
> However, many customer operations are not allowed for device partitioning.

Jianpeng, could you be more specific on what these customer operation stand for?

> This isn't easier to manage and operate.

i agree that to create/manage partitions adds more complexity to the
administrator, but

- it is not worse than what we have now -- we are already tackling
with the partitions on disks now.
- it is not *that* complicated, right?

> I think we should optimize in this direction.
>
> Thanks!
> Jianpeng
> > The proposed design explores following assumption: if the current RADOS
> > infrastructure was able to withstand the resource (connections, osdmap)
> > inflation in such scenarios, it likely can absorb several times more.
> > Ensuring we truly have the extra capacity is *crucial* requirement.
> >
> > Personally I perceive the OSD *concept* as networked ObjectStore instance
> > exposed over the RADOS protocol.
> >
> > > How should a user
> > > think about it?  How should the user think about the governing process?
> >
> > No different than in the current deployment scenario where multiple OSDs
> > are spanning the same physical device. OSD would no longer bound to a disk
> > but rather to a partition.
> >
> > > Josh rightly pointed out to me that when you get right down to it, an
> > > OSD as it exists today is a failure domain.  That's still true here,
> > > but these OSDs seem a lot more like storage shards that theoretically
> > > exist as separate failure domains but for all practical purposes act
> > > as groups.
> >
> > In addition to being leaf entity of the failure domain division, I think OSD is
> > also an entity of the RADOS name resolution (I see RADOS resolver as a
> > component responsible for translating pool/object name into a tuple, with ip
> > and port inside, constituting straight path to an ObjectStore).
> >
> > As these concepts are currently glued altogether, the vendors' strategy to
> > increase the number of resolution entities is being reflected by exposing the
> > physical disk partitioning in e.g. `osd tree` output. This has its own functional
> > traits. Surely, more complex deployment is a downside.
> > However, aren't such activities supposed to be hidden by Ansible/Rook/*?
> >
> > > IE are there good architectural reasons to map failure domains down to
> > > "cores" rather than "disks"?  I think we want this because it's
> > > convenient that each OSD shard would have it's own msgr and heartbeat
> > > services and we can avoid cross-core communication.  It might even be
> > > the right decision practically, but I'm not sure that conceptually it
> > > really makes a lot of sense to me.
> >
> > Conceptually we would still map to ObjectStore instance, not "core".
> > The fact it can be (and even currently is!) laid down on a block device being a
> > derivate of another block device looks like an implementation detail of our
> > deployment process. I'm afraid that mapping failure domain to "disk" was
> > fuzzy even before the NVMe era -- with FileStore consuming single HDD + a
> > "partition" of shared SSD.
> >
> > One of the fundamental benefits I see is keeping the RADOS name resolver
> > intact. It still consists one level only: the CRUSH name resolution. No in-OSD
> > crossbar is necessary. Therefore I expect no desire for a RADOS extension
> > bypassing the new stage by memorizing the mapping it brings.
> > That is, in addition to simplifying the crimson-osd design (stripping all
> > seastar::sharded<...> and seastar::foreign_ptrs), there would be absolutely no
> > modification to the protocol and clients. This means no need for a logic
> > handling backward compatibility.
> >
> > > It's a fair point.  To also play devil's advocate: If you are storing
> > > cache per OSD and the size of each cache grows with the number of
> > > OSDs, what happens as the number of cores / node grows? Maybe we are
> > > ok with current core counts.  Would we still be ok with 256+ cores in
> > > a single node if the number of caches and the size of each cache grows
> > together?
> >
> > Well, osdmap uses a dedicated mempool. FWIW, local testing and grepping
> > ceph-users for mempool_dumps suggest the cache stays in hundreds of KBs
> > range. The (rough!) testing also shows linear growth with the number of
> > OSDs. Still, even tens of MBs/cache instance might be acceptable as:
> >   * economy class (HDDs) would likely use single OSD/single disk -- no
> >     regression from what we have right now.
> >   * High-end already deploys multiple OSDs/device and memory is rather
> >     little concern -- just like in already pointed out case of powerful
> >     enough switches/network infrastructure.
> >
> > Regards,
> > Radek
> >
> > [1] Micron ® 9200 MAX NVMeTM SSDs + Red Hat ® Ceph Storage 3.0,
> > Reference Architecture

-- 
Regards
Kefu Chai