Re: single-threaded seastar-osd

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 7 Jan 2019 17:43:46 -0600

On 1/6/19 7:16 PM, Radoslaw Zarzynski wrote:
On Sat, Jan 5, 2019 at 7:42 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
On Sat, 5 Jan 2019, kefu chai wrote:
- unable to share the osdmap cache. considering a high-density storage
deployment, where more than 40 disks are squeezed into a host, if we
are not able to reuse the osdmap cache. that's a shame..
I think this is a bit of a red herring.  We could confine each OSD to a
single core, each with its own distinct messenger, with no cross-core
context switches in the IO path, but still put all (or many) OSDs inside
the same process with a shared OSDMap cache.  The semantics of sharing the
cache are comparatively simple, with a immutable maps that are only
occasionally added.
Agreed, resource sharing and engine isolation are somewhat related but
definitely not the same issues. From reduced sharing we expect better
performance and simplicity, from stronger isolation - less bugs and
further cost reduction. To exemplify: perfect shared-nothing design allows
to switch all std::shared_ptrs to non-atomic seastar::shared_ptrs (and for
ceph::atomic wrapper in general ;-), perfect engine isolation (single
thread) would let to merge the patches without turning over all rocks for
correctness validation -- it would come just from the definition of data
race. There would be no bug due to sharing unnoticed in the review.

I want to know what an OSD means in this context.  How should a user 
think about it?  How should the user think about the governing process?  
Josh rightly pointed out to me that when you get right down to it, an 
OSD as it exists today is a failure domain.  That's still true here, but 
these OSDs seem a lot more like storage shards that theoretically exist 
as separate failure domains but for all practical purposes act as 
groups.  IE are there good architectural reasons to map failure domains 
down to "cores" rather than "disks"?  I think we want this because it's 
convenient that each OSD shard would have it's own msgr and heartbeat 
services and we can avoid cross-core communication.  It might even be 
the right decision practically, but I'm not sure that conceptually it 
really makes a lot of sense to me.

Without judging the reasonableness for now, I would like to just signalize
that shared-something is theoretically possible even in the 1 OSD/1 process
approach via shared memory.

Before going further, let me play the accountant's advocate and
ask: is *spending* the complexity on the shared cache really worth
benefits we could get? How much memory can we save?

It's a fair point.  To also play devil's advocate: If you are storing 
cache per OSD and the size of each cache grows with the number of OSDs, 
what happens as the number of cores / node grows? Maybe we are ok with 
current core counts.  Would we still be ok with 256+ cores in a single 
node if the number of caches and the size of each cache grows together?

 From yet another side: multiple OSDs in the same process but with *almost*
no sharing would *still* allow for the user-space IO scheduler Kefu has
pointed out some time ago.

Regards,
Radek