On Tue, Jan 8, 2019 at 12:43 AM Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > I want to know what an OSD means in this context. Let me start with bringing more context on how the concept was born. It came just from the observation that vendors tend to deploy multiple ceph-osd daemons on a single NVMe device in their performance testing. It's not unusual to see 48 physical cores serving 10 NVMes with 2 OSDs on each like in the Micron's document [1]. This translates into 2.4 physical cores per ceph-osd. The proposed design explores following assumption: if the current RADOS infrastructure was able to withstand the resource (connections, osdmap) inflation in such scenarios, it likely can absorb several times more. Ensuring we truly have the extra capacity is *crucial* requirement. Personally I perceive the OSD *concept* as networked ObjectStore instance exposed over the RADOS protocol. > How should a user > think about it? How should the user think about the governing process? No different than in the current deployment scenario where multiple OSDs are spanning the same physical device. OSD would no longer bound to a disk but rather to a partition. > Josh rightly pointed out to me that when you get right down to it, an > OSD as it exists today is a failure domain. That's still true here, but > these OSDs seem a lot more like storage shards that theoretically exist > as separate failure domains but for all practical purposes act as > groups. In addition to being leaf entity of the failure domain division, I think OSD is also an entity of the RADOS name resolution (I see RADOS resolver as a component responsible for translating pool/object name into a tuple, with ip and port inside, constituting straight path to an ObjectStore). As these concepts are currently glued altogether, the vendors' strategy to increase the number of resolution entities is being reflected by exposing the physical disk partitioning in e.g. `osd tree` output. This has its own functional traits. Surely, more complex deployment is a downside. However, aren't such activities supposed to be hidden by Ansible/Rook/*? > IE are there good architectural reasons to map failure domains > down to "cores" rather than "disks"? I think we want this because it's > convenient that each OSD shard would have it's own msgr and heartbeat > services and we can avoid cross-core communication. It might even be > the right decision practically, but I'm not sure that conceptually it > really makes a lot of sense to me. Conceptually we would still map to ObjectStore instance, not "core". The fact it can be (and even currently is!) laid down on a block device being a derivate of another block device looks like an implementation detail of our deployment process. I'm afraid that mapping failure domain to "disk" was fuzzy even before the NVMe era -- with FileStore consuming single HDD + a "partition" of shared SSD. One of the fundamental benefits I see is keeping the RADOS name resolver intact. It still consists one level only: the CRUSH name resolution. No in-OSD crossbar is necessary. Therefore I expect no desire for a RADOS extension bypassing the new stage by memorizing the mapping it brings. That is, in addition to simplifying the crimson-osd design (stripping all seastar::sharded<...> and seastar::foreign_ptrs), there would be absolutely no modification to the protocol and clients. This means no need for a logic handling backward compatibility. > It's a fair point. To also play devil's advocate: If you are storing > cache per OSD and the size of each cache grows with the number of OSDs, > what happens as the number of cores / node grows? Maybe we are ok with > current core counts. Would we still be ok with 256+ cores in a single > node if the number of caches and the size of each cache grows together? Well, osdmap uses a dedicated mempool. FWIW, local testing and grepping ceph-users for mempool_dumps suggest the cache stays in hundreds of KBs range. The (rough!) testing also shows linear growth with the number of OSDs. Still, even tens of MBs/cache instance might be acceptable as: * economy class (HDDs) would likely use single OSD/single disk -- no regression from what we have right now. * High-end already deploys multiple OSDs/device and memory is rather little concern -- just like in already pointed out case of powerful enough switches/network infrastructure. Regards, Radek [1] Micron ® 9200 MAX NVMeTM SSDs + Red Hat ® Ceph Storage 3.0, Reference Architecture