Re: single-threaded seastar-osd

Radoslaw Zarzynski <rzarzyns@xxxxxxxxxx> · Wed, 9 Jan 2019 02:31:05 +0100

On Tue, Jan 8, 2019 at 12:43 AM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
> I want to know what an OSD means in this context.

Let me start with bringing more context on how the concept was born.
It came just from the observation that vendors tend to deploy multiple
ceph-osd daemons on a single NVMe device in their performance testing.
It's not unusual to see 48 physical cores serving 10 NVMes with 2 OSDs
on each like in the Micron's document [1]. This translates into 2.4
physical cores per ceph-osd.

The proposed design explores following assumption: if the current RADOS
infrastructure was able to withstand the resource (connections, osdmap)
inflation in such scenarios, it likely can absorb several times more.
Ensuring we truly have the extra capacity is *crucial* requirement.

Personally I perceive the OSD *concept* as networked ObjectStore instance
exposed over the RADOS protocol.

> How should a user
> think about it?  How should the user think about the governing process?

No different than in the current deployment scenario where multiple OSDs
are spanning the same physical device. OSD would no longer bound to a disk
but rather to a partition.

> Josh rightly pointed out to me that when you get right down to it, an
> OSD as it exists today is a failure domain.  That's still true here, but
> these OSDs seem a lot more like storage shards that theoretically exist
> as separate failure domains but for all practical purposes act as
> groups.

In addition to being leaf entity of the failure domain division, I think
OSD is also an entity of the RADOS name resolution (I see RADOS resolver
as a component responsible for translating pool/object name into a tuple,
with ip and port inside, constituting straight path to an ObjectStore).

As these concepts are currently glued altogether, the vendors' strategy to
increase the number of resolution entities is being reflected by exposing
the physical disk partitioning in e.g. `osd tree` output. This has its own
functional traits. Surely, more complex deployment is a downside.
However, aren't such activities supposed to be hidden by Ansible/Rook/*?

> IE are there good architectural reasons to map failure domains
> down to "cores" rather than "disks"?  I think we want this because it's
> convenient that each OSD shard would have it's own msgr and heartbeat
> services and we can avoid cross-core communication.  It might even be
> the right decision practically, but I'm not sure that conceptually it
> really makes a lot of sense to me.

Conceptually we would still map to ObjectStore instance, not "core".
The fact it can be (and even currently is!) laid down on a block device
being a derivate of another block device looks like an implementation
detail of our deployment process. I'm afraid that mapping failure domain
to "disk" was fuzzy even before the NVMe era -- with FileStore consuming
single HDD + a "partition" of shared SSD.

One of the fundamental benefits I see is keeping the RADOS name resolver
intact. It still consists one level only: the CRUSH name resolution. No
in-OSD crossbar is necessary. Therefore I expect no desire for a RADOS
extension bypassing the new stage by memorizing the mapping it brings.
That is, in addition to simplifying the crimson-osd design (stripping all
seastar::sharded<...> and seastar::foreign_ptrs), there would be absolutely
no modification to the protocol and clients. This means no need for a logic
handling backward compatibility.

> It's a fair point.  To also play devil's advocate: If you are storing
> cache per OSD and the size of each cache grows with the number of OSDs,
> what happens as the number of cores / node grows? Maybe we are ok with
> current core counts.  Would we still be ok with 256+ cores in a single
> node if the number of caches and the size of each cache grows together?

Well, osdmap uses a dedicated mempool. FWIW, local testing and grepping
ceph-users for mempool_dumps suggest the cache stays in hundreds of KBs
range. The (rough!) testing also shows linear growth with the number of
OSDs. Still, even tens of MBs/cache instance might be acceptable as:
  * economy class (HDDs) would likely use single OSD/single disk -- no
    regression from what we have right now.
  * High-end already deploys multiple OSDs/device and memory is rather
    little concern -- just like in already pointed out case of powerful
    enough switches/network infrastructure.

Regards,
Radek

[1] Micron ® 9200 MAX NVMeTM SSDs + Red Hat ® Ceph Storage 3.0,
Reference Architecture