RE: single-threaded seastar-osd

"Ma, Jianpeng" <jianpeng.ma@xxxxxxxxx> · Wed, 9 Jan 2019 02:18:50 +0000



> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Radoslaw Zarzynski
> Sent: Wednesday, January 9, 2019 9:31 AM
> To: Mark Nelson <mnelson@xxxxxxxxxx>
> Cc: Sage Weil <sweil@xxxxxxxxxx>; kefu chai <tchaikov@xxxxxxxxx>; The
> Esoteric Order of the Squid Cybernetic <ceph-devel@xxxxxxxxxxxxxxx>; Cheng,
> Yingxin <yingxin.cheng@xxxxxxxxx>
> Subject: Re: single-threaded seastar-osd
> 
> On Tue, Jan 8, 2019 at 12:43 AM Mark Nelson <mnelson@xxxxxxxxxx>
> wrote:
> >
> > I want to know what an OSD means in this context.
> 
> Let me start with bringing more context on how the concept was born.
> It came just from the observation that vendors tend to deploy multiple ceph-
> osd daemons on a single NVMe device in their performance testing.
> It's not unusual to see 48 physical cores serving 10 NVMes with 2 OSDs on
> each like in the Micron's document [1]. This translates into 2.4 physical cores
> per ceph-osd.
> 
 Because bluestore can't fully utilize NVME, it will cause partitions to be used.
However, many customer operations are not allowed for device partitioning.
This isn't easier to manage and operate. 
I think we should optimize in this direction.

Thanks!
Jianpeng
> The proposed design explores following assumption: if the current RADOS
> infrastructure was able to withstand the resource (connections, osdmap)
> inflation in such scenarios, it likely can absorb several times more.
> Ensuring we truly have the extra capacity is *crucial* requirement.
> 
> Personally I perceive the OSD *concept* as networked ObjectStore instance
> exposed over the RADOS protocol.
> 
> > How should a user
> > think about it?  How should the user think about the governing process?
> 
> No different than in the current deployment scenario where multiple OSDs
> are spanning the same physical device. OSD would no longer bound to a disk
> but rather to a partition.
> 
> > Josh rightly pointed out to me that when you get right down to it, an
> > OSD as it exists today is a failure domain.  That's still true here,
> > but these OSDs seem a lot more like storage shards that theoretically
> > exist as separate failure domains but for all practical purposes act
> > as groups.
> 
> In addition to being leaf entity of the failure domain division, I think OSD is
> also an entity of the RADOS name resolution (I see RADOS resolver as a
> component responsible for translating pool/object name into a tuple, with ip
> and port inside, constituting straight path to an ObjectStore).
> 
> As these concepts are currently glued altogether, the vendors' strategy to
> increase the number of resolution entities is being reflected by exposing the
> physical disk partitioning in e.g. `osd tree` output. This has its own functional
> traits. Surely, more complex deployment is a downside.
> However, aren't such activities supposed to be hidden by Ansible/Rook/*?
> 
> > IE are there good architectural reasons to map failure domains down to
> > "cores" rather than "disks"?  I think we want this because it's
> > convenient that each OSD shard would have it's own msgr and heartbeat
> > services and we can avoid cross-core communication.  It might even be
> > the right decision practically, but I'm not sure that conceptually it
> > really makes a lot of sense to me.
> 
> Conceptually we would still map to ObjectStore instance, not "core".
> The fact it can be (and even currently is!) laid down on a block device being a
> derivate of another block device looks like an implementation detail of our
> deployment process. I'm afraid that mapping failure domain to "disk" was
> fuzzy even before the NVMe era -- with FileStore consuming single HDD + a
> "partition" of shared SSD.
> 
> One of the fundamental benefits I see is keeping the RADOS name resolver
> intact. It still consists one level only: the CRUSH name resolution. No in-OSD
> crossbar is necessary. Therefore I expect no desire for a RADOS extension
> bypassing the new stage by memorizing the mapping it brings.
> That is, in addition to simplifying the crimson-osd design (stripping all
> seastar::sharded<...> and seastar::foreign_ptrs), there would be absolutely no
> modification to the protocol and clients. This means no need for a logic
> handling backward compatibility.
> 
> > It's a fair point.  To also play devil's advocate: If you are storing
> > cache per OSD and the size of each cache grows with the number of
> > OSDs, what happens as the number of cores / node grows? Maybe we are
> > ok with current core counts.  Would we still be ok with 256+ cores in
> > a single node if the number of caches and the size of each cache grows
> together?
> 
> Well, osdmap uses a dedicated mempool. FWIW, local testing and grepping
> ceph-users for mempool_dumps suggest the cache stays in hundreds of KBs
> range. The (rough!) testing also shows linear growth with the number of
> OSDs. Still, even tens of MBs/cache instance might be acceptable as:
>   * economy class (HDDs) would likely use single OSD/single disk -- no
>     regression from what we have right now.
>   * High-end already deploys multiple OSDs/device and memory is rather
>     little concern -- just like in already pointed out case of powerful
>     enough switches/network infrastructure.
> 
> Regards,
> Radek
> 
> [1] Micron ® 9200 MAX NVMeTM SSDs + Red Hat ® Ceph Storage 3.0,
> Reference Architecture