Re: single-threaded seastar-osd

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Wed, 9 Jan 2019 04:06:38 +0100 (CET)

> 
>> Because bluestore can't fully utilize NVME, it will cause partitions to be used.

Wasn't it because of single threaded finisher ? 
https://www.spinics.net/lists/ceph-devel/msg39009.html

If yes, I think it has been fixed in nautilus ?

----- Mail original -----
De: "Ma, Jianpeng" <jianpeng.ma@xxxxxxxxx>
À: "Radoslaw Zarzynski" <rzarzyns@xxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx>
Cc: "Sage Weil" <sweil@xxxxxxxxxx>, "kefu chai" <tchaikov@xxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>, "Cheng, Yingxin" <yingxin.cheng@xxxxxxxxx>
Envoyé: Mercredi 9 Janvier 2019 03:18:50
Objet: RE: single-threaded seastar-osd

> -----Original Message----- 
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- 
> owner@xxxxxxxxxxxxxxx] On Behalf Of Radoslaw Zarzynski 
> Sent: Wednesday, January 9, 2019 9:31 AM 
> To: Mark Nelson <mnelson@xxxxxxxxxx> 
> Cc: Sage Weil <sweil@xxxxxxxxxx>; kefu chai <tchaikov@xxxxxxxxx>; The 
> Esoteric Order of the Squid Cybernetic <ceph-devel@xxxxxxxxxxxxxxx>; Cheng, 
> Yingxin <yingxin.cheng@xxxxxxxxx> 
> Subject: Re: single-threaded seastar-osd 
> 
> On Tue, Jan 8, 2019 at 12:43 AM Mark Nelson <mnelson@xxxxxxxxxx> 
> wrote: 
> > 
> > I want to know what an OSD means in this context. 
> 
> Let me start with bringing more context on how the concept was born. 
> It came just from the observation that vendors tend to deploy multiple ceph- 
> osd daemons on a single NVMe device in their performance testing. 
> It's not unusual to see 48 physical cores serving 10 NVMes with 2 OSDs on 
> each like in the Micron's document [1]. This translates into 2.4 physical cores 
> per ceph-osd. 
> 
Because bluestore can't fully utilize NVME, it will cause partitions to be used. 
However, many customer operations are not allowed for device partitioning. 
This isn't easier to manage and operate. 
I think we should optimize in this direction. 

Thanks! 
Jianpeng 
> The proposed design explores following assumption: if the current RADOS 
> infrastructure was able to withstand the resource (connections, osdmap) 
> inflation in such scenarios, it likely can absorb several times more. 
> Ensuring we truly have the extra capacity is *crucial* requirement. 
> 
> Personally I perceive the OSD *concept* as networked ObjectStore instance 
> exposed over the RADOS protocol. 
> 
> > How should a user 
> > think about it? How should the user think about the governing process? 
> 
> No different than in the current deployment scenario where multiple OSDs 
> are spanning the same physical device. OSD would no longer bound to a disk 
> but rather to a partition. 
> 
> > Josh rightly pointed out to me that when you get right down to it, an 
> > OSD as it exists today is a failure domain. That's still true here, 
> > but these OSDs seem a lot more like storage shards that theoretically 
> > exist as separate failure domains but for all practical purposes act 
> > as groups. 
> 
> In addition to being leaf entity of the failure domain division, I think OSD is 
> also an entity of the RADOS name resolution (I see RADOS resolver as a 
> component responsible for translating pool/object name into a tuple, with ip 
> and port inside, constituting straight path to an ObjectStore). 
> 
> As these concepts are currently glued altogether, the vendors' strategy to 
> increase the number of resolution entities is being reflected by exposing the 
> physical disk partitioning in e.g. `osd tree` output. This has its own functional 
> traits. Surely, more complex deployment is a downside. 
> However, aren't such activities supposed to be hidden by Ansible/Rook/*? 
> 
> > IE are there good architectural reasons to map failure domains down to 
> > "cores" rather than "disks"? I think we want this because it's 
> > convenient that each OSD shard would have it's own msgr and heartbeat 
> > services and we can avoid cross-core communication. It might even be 
> > the right decision practically, but I'm not sure that conceptually it 
> > really makes a lot of sense to me. 
> 
> Conceptually we would still map to ObjectStore instance, not "core". 
> The fact it can be (and even currently is!) laid down on a block device being a 
> derivate of another block device looks like an implementation detail of our 
> deployment process. I'm afraid that mapping failure domain to "disk" was 
> fuzzy even before the NVMe era -- with FileStore consuming single HDD + a 
> "partition" of shared SSD. 
> 
> One of the fundamental benefits I see is keeping the RADOS name resolver 
> intact. It still consists one level only: the CRUSH name resolution. No in-OSD 
> crossbar is necessary. Therefore I expect no desire for a RADOS extension 
> bypassing the new stage by memorizing the mapping it brings. 
> That is, in addition to simplifying the crimson-osd design (stripping all 
> seastar::sharded<...> and seastar::foreign_ptrs), there would be absolutely no 
> modification to the protocol and clients. This means no need for a logic 
> handling backward compatibility. 
> 
> > It's a fair point. To also play devil's advocate: If you are storing 
> > cache per OSD and the size of each cache grows with the number of 
> > OSDs, what happens as the number of cores / node grows? Maybe we are 
> > ok with current core counts. Would we still be ok with 256+ cores in 
> > a single node if the number of caches and the size of each cache grows 
> together? 
> 
> Well, osdmap uses a dedicated mempool. FWIW, local testing and grepping 
> ceph-users for mempool_dumps suggest the cache stays in hundreds of KBs 
> range. The (rough!) testing also shows linear growth with the number of 
> OSDs. Still, even tens of MBs/cache instance might be acceptable as: 
> * economy class (HDDs) would likely use single OSD/single disk -- no 
> regression from what we have right now. 
> * High-end already deploys multiple OSDs/device and memory is rather 
> little concern -- just like in already pointed out case of powerful 
> enough switches/network infrastructure. 
> 
> Regards, 
> Radek 
> 
> [1] Micron ® 9200 MAX NVMeTM SSDs + Red Hat ® Ceph Storage 3.0, 
> Reference Architecture