> -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Radoslaw Zarzynski > Sent: Wednesday, January 9, 2019 9:31 AM > To: Mark Nelson <mnelson@xxxxxxxxxx> > Cc: Sage Weil <sweil@xxxxxxxxxx>; kefu chai <tchaikov@xxxxxxxxx>; The > Esoteric Order of the Squid Cybernetic <ceph-devel@xxxxxxxxxxxxxxx>; Cheng, > Yingxin <yingxin.cheng@xxxxxxxxx> > Subject: Re: single-threaded seastar-osd > > On Tue, Jan 8, 2019 at 12:43 AM Mark Nelson <mnelson@xxxxxxxxxx> > wrote: > > > > I want to know what an OSD means in this context. > > Let me start with bringing more context on how the concept was born. > It came just from the observation that vendors tend to deploy multiple ceph- > osd daemons on a single NVMe device in their performance testing. > It's not unusual to see 48 physical cores serving 10 NVMes with 2 OSDs on > each like in the Micron's document [1]. This translates into 2.4 physical cores > per ceph-osd. > Because bluestore can't fully utilize NVME, it will cause partitions to be used. However, many customer operations are not allowed for device partitioning. This isn't easier to manage and operate. I think we should optimize in this direction. Thanks! Jianpeng > The proposed design explores following assumption: if the current RADOS > infrastructure was able to withstand the resource (connections, osdmap) > inflation in such scenarios, it likely can absorb several times more. > Ensuring we truly have the extra capacity is *crucial* requirement. > > Personally I perceive the OSD *concept* as networked ObjectStore instance > exposed over the RADOS protocol. > > > How should a user > > think about it? How should the user think about the governing process? > > No different than in the current deployment scenario where multiple OSDs > are spanning the same physical device. OSD would no longer bound to a disk > but rather to a partition. > > > Josh rightly pointed out to me that when you get right down to it, an > > OSD as it exists today is a failure domain. That's still true here, > > but these OSDs seem a lot more like storage shards that theoretically > > exist as separate failure domains but for all practical purposes act > > as groups. > > In addition to being leaf entity of the failure domain division, I think OSD is > also an entity of the RADOS name resolution (I see RADOS resolver as a > component responsible for translating pool/object name into a tuple, with ip > and port inside, constituting straight path to an ObjectStore). > > As these concepts are currently glued altogether, the vendors' strategy to > increase the number of resolution entities is being reflected by exposing the > physical disk partitioning in e.g. `osd tree` output. This has its own functional > traits. Surely, more complex deployment is a downside. > However, aren't such activities supposed to be hidden by Ansible/Rook/*? > > > IE are there good architectural reasons to map failure domains down to > > "cores" rather than "disks"? I think we want this because it's > > convenient that each OSD shard would have it's own msgr and heartbeat > > services and we can avoid cross-core communication. It might even be > > the right decision practically, but I'm not sure that conceptually it > > really makes a lot of sense to me. > > Conceptually we would still map to ObjectStore instance, not "core". > The fact it can be (and even currently is!) laid down on a block device being a > derivate of another block device looks like an implementation detail of our > deployment process. I'm afraid that mapping failure domain to "disk" was > fuzzy even before the NVMe era -- with FileStore consuming single HDD + a > "partition" of shared SSD. > > One of the fundamental benefits I see is keeping the RADOS name resolver > intact. It still consists one level only: the CRUSH name resolution. No in-OSD > crossbar is necessary. Therefore I expect no desire for a RADOS extension > bypassing the new stage by memorizing the mapping it brings. > That is, in addition to simplifying the crimson-osd design (stripping all > seastar::sharded<...> and seastar::foreign_ptrs), there would be absolutely no > modification to the protocol and clients. This means no need for a logic > handling backward compatibility. > > > It's a fair point. To also play devil's advocate: If you are storing > > cache per OSD and the size of each cache grows with the number of > > OSDs, what happens as the number of cores / node grows? Maybe we are > > ok with current core counts. Would we still be ok with 256+ cores in > > a single node if the number of caches and the size of each cache grows > together? > > Well, osdmap uses a dedicated mempool. FWIW, local testing and grepping > ceph-users for mempool_dumps suggest the cache stays in hundreds of KBs > range. The (rough!) testing also shows linear growth with the number of > OSDs. Still, even tens of MBs/cache instance might be acceptable as: > * economy class (HDDs) would likely use single OSD/single disk -- no > regression from what we have right now. > * High-end already deploys multiple OSDs/device and memory is rather > little concern -- just like in already pointed out case of powerful > enough switches/network infrastructure. > > Regards, > Radek > > [1] Micron ® 9200 MAX NVMeTM SSDs + Red Hat ® Ceph Storage 3.0, > Reference Architecture