Re: single-threaded seastar-osd

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Tue, 8 Jan 2019 09:28:35 -0600

On 1/8/19 7:01 AM, kefu chai wrote:
On Tue, Jan 8, 2019 at 1:52 AM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 1/5/19 8:41 AM, kefu chai wrote:

as you might know, seastar encourage a share-nothing programming
paradigm. as in previous discussions we found that there are always
some cross-core communications in the sharded seastar-osd, because
there are couple infrastructures could be shared by a sharded OSD,
namely:

- osdmap cache
- connection to peer OSDs, and heartbeats with them
- connection to monitor and mgr, and beacon/reports to them
- i/o to the underlying objectstore

recently, when we are working on cross-core messenger[0], we found
that, in order to share the connection between cores we need to have
types like "seastar::lw_shared_ptr<seastar::foreign_ptr<ConnectionRef>>",
because
- the connections to peer OSDs are shared across cores,
- the connections are shared by multiple continuations on the local
core -- either locally or remotely.

and we need to perform i/o on the core where the connection is
established. personally, i feel that it's a bad smell, as it's
complicated and always involves cross-core communications.

Radoslaw suggested an alternative: single-threaded OSD which pushes
the share-nothing design to another level. in this design, just like
the existing model, an OSD host will still have multiple instances of
OSD, but each instance of OSD will be running on and only on its own
designated core. nothing will be shared across these OSD instances. so
we can still benefit from Seastar, and at the same time, won't be
worried about the complexities and performance degradation due to
cross-core communications. this design resembles the co-located OSD
design we were talking about in the sense that all OSDs will reside in
the same process. but they are different in that, it enforces strict
share-nothing model.

but on the other side, single-threaded OSD has following
restrictions/assumptions:

- 1-to-1 mapping from core to OSD. some of the following questions
also apply to the NIC.
    * Mark worried that what if we have more stores than cores. or more
cores than disks? how can we do the mapping. probably to group disks
into an LVM? but that would increase the load of the core which gets
mapped to that LVM volume. which causes imbalance of the load, i
think.
    * how about more cores than stores?
    * how to shard a high throughput storage device?
      for instance, to take the full advantage of an NVMe storage
device, we might need to drive it with 4 or more cores. but how to do
it? can we leverage virtualization techniques like SPDK-vhost,
SPDK-Blobstore? for a device supporting SR-IOV, it'd be probably
simpler.
- unable to share the osdmap cache. considering a high-density storage
deployment, where more than 40 disks are squeezed into a host, if we
are not able to reuse the osdmap cache. that's a shame..
- unable to share the connection to peer OSDs, mon and mgr.
    probably it's not a big deal in comparison to existing non
co-located OSD, but if we compare it with the co-located OSD, well,
you'll see what we will be missing.

we had some discussions on this topic recently on the crimson standup
and on the perf meeting. but i feel that the only consensus we reached
is that it's difficult to tell which way to go -- 1-1 mapping or m-n
mapping.

what i can think of is to avoid making the decision now, and instead
to encapsulate the difference between these two approach as much as
possible in smaller scopes. for instance, to hide the difference
between a shared messenger and a non-shared messenger in the
messenger's implementation itself, and provide a consistent API to the
caller/dispatcher. so we can switch over to the single-threaded OSD in
future if necessary with less pain. but i admit that it does not
address the complexity of the pain of
"seastar::lw_shared_ptr<seastar::foreign_ptr<ConnectionRef>>". =)

thoughts?

Hi Kefu,

I think you hit the nail on the head regarding the technical discussion
here.  After the perf call last week I spent some time researching
SR-IOV and also reached out to some of the hardware vendors for
clarification regarding capabilities/plans regarding storage hardware
virtualization.  So far my impression is:

- SR-IOV is fairly well supported on the networking side.

- Currently very few storage products support hardware virtualization
via multiple namespaces or SR-IOV VFs.

- The NVMe 1.3 spec will help, but I can't find any documentation
regarding enforcing minimum standards

- SR-IOV in general will require CPU/MB support in addition to
network/disk support.

- SR-IOV is probably the future, but we may not strictly need it for the
single-core OSD model (blobstore?  blobfs?  Direct access to HW queues?)

Perhaps folks from HW vendors can jump in and fix any misconceptions
here or provide guidance regarding future direction. I think a lot of
this comes down to how we view what an OSD actually represents.  Is it
just some simple/dumb entity (almost like a shard) that executes on some
fraction of hardware that is governed at a higher level?  Alternately,
does the OSD represent a grouping of hardware entities that have some
relationship and "closeness" to one another that can be exploited in
ways that we can't exploit at a higher level?  How much autonomy does an

yes, i think it represents a combination of
- failure domains for better HA
- physically connected components which helps us approach to better
performance. please consider NUMA, and probably smart drive programmed
with (part of) ceph-osd stack!
- physically partitioned/sharded resources which helps us to make
better use of them with a better TCO. please consider multi-core CPU,
high-throughput NVMe device, NIC supporting SR-IOV, and virtualization
techniques.

I guess this is where I think we are being sort of vague and loose with 
our terminology.  These could still be individual daemons, but they 
don't appear to be distinct failure domains anymore (though even in our 
existing model they aren't really distinct in many cases).  From a 
convenience and maybe performance standpoint it could be convenient to 
have a msgr per core.  It might be the right decision.  It seems to push 
us further into the territory of the OSD acting as a storage shard that 
exists as part of an abstract failure domain that the OSD itself doesn't 
really encapsulate.  Again I'm not saying it's the wrong choice, just 
that I think the overall result might be pretty confusing for users to 
think about.

OSD have to make decisions on it's own?  How should a user think about
OSDs that exist in this kind of model?

i think an OSD is a minimal combination of resources which is
self-contained, so it can work by its own in the sense that it is
discoverable, and is smart enough to work with its peers. it's like an
Intelligent agent in artificial intelligence.

So in that sense you could sort of think of a single core with shared 
disk, memory, network, and cache being that minimal unit of resources. 
I'm not sure I think it makes sense though.  I don't know of any linux 
systems out there that let an individual core fail by itself.  It seems 
like we're really just deciding here if we want to adopt the convenience 
of having a msgr per core for a large increase (in some cases) of OSDs 
per storage device with the associated trade-offs.

The single-core OSD approach is attractive in a lot of ways.  You can
avoid the cross-core messenger issue. Avoid issues which historically
have required locking (pg!) and/or atomics.   It also lends itself well
to concurrent OSDs running on the same filesystem or block devices, or,
in the future, using SR-IOV with SPDK for HW virtualization without any
kernel involvement.  If we go this route, I think we should focus on
keeping the OSD as simple as possible as that's sort of the main
advantage with this approach.  We'll really need to focus on supporting
massive numbers of OSDs smoothly and easily.  We'll be relying heavily
on crush and the monitors to scale well.  I imagine it could be very
successful for small-medium sized installations.  Other questions:  How
much memory does this waste? (Greg specifically pointed this out)  What

i think it depends,

- if we share the osdmap cache across the co-located OSDs, probably
the question is more like: how much memory does this save. because it
is exactly the co-located solution + shared osdmap cache.
- if the osdmap cache is not shared, it's not worse than the existing
multi-process OSD deployment model.

Are we copying the osdmap per thread inside the OSD right now?  I 
haven't looked that closely.  I would think this model would increase 
memory usage without a shared osmap cache?

do you do when the disk/core mapping isn't clean (increasing the number
of OSDs to get better balance exacerbates some of the issues)?  What

could you define "isn't clean"? i think if we go with the co-located
1:1 multi-threaded OSD, when we mkfs an OSD, a new OSD instance is
registered, when we launch the service like "systemctl start
ceph-osd@0", it will start the thread on, for example, the next free
core. if we launch "systemctl start ceph.target", then all OSD will be
enumerated and mapped to its own core. if there are more disks than
cores, guess we should ... fail at the mkfs phase =)

Sorry, I wasn't clear.  I did in fact mean when you don't have a 1:1 
mapping.  Ideally we'd keep resources even (or perhaps have slightly 
more cores than disks) and try to make sure that each core is fast 
enough to process requests for 1 storage resource.  Maybe adopting this 
model is worth it even when things are skewed for the simplicity.

does the PG mapping and pglog look like if you potentially have 8+
OSDs/NVMe drive?  What is the emergent behavior of the system when you

why does it matters? is it different from the existing model so it
does not scale?

If we go down the 1 OSD per core route, it probably means that we're 
putting multiple OSDs on fast devices (unless the savings are so great 
that our CPU usage goes down by like 8x+).  That sort of forces us into 
assuming that fast devices can handle concurrent sequential writes 
quickly (at the very least separate rocksDB WALs).  IE we're sort of 
implicitly sharding multiple DBs onto the device.  That's something I've 
advocated for in bluestore (vs sharding across column families) so I'm 
not going to criticize it, but I think we should acknowledge that this 
is what we are more or less implicitly doing in this model.

take this model to the extreme?

could you name some of them so we can be more specific?

Say in 5 years if optane or nvdimm style technologies become 
significantly cheaper and yet we still can't scale cores to be any 
faster (but we get more of them).  What happens if we start needing 
dozens of OSDs on one device to achieve high performance?  Do the 
underlying layers hold up?  Does the model in general hold up?

An alternate I've been thinking about is to treat the OSD as a logical
grouping of close hardware (maybe a numa-node).  Fewer OSDs and so a
lower requirement on the monitors and crush, but with increased OSD
complexity.  Can we still shard internally and just eat memory copies
where appropriate (or deal with atomics and/or locking)?  Does the
advent of nvdimm and other very fast persistent memory make crush less
relevant within a specific numa node for data placement (how valuable is
dynamic data placement within a numa node?).  How much memory do we save
by avoiding have tons of tiny OSDs?  My own goal with this model would
be to make the OSD as self-configuring as possible and avoid external
API complexity and decision making.  The OSD takes control over a
specific hardware group and has far less external instructions regarding
how to make use of it.  The goal would be to ask much less of ceph-disk,

i think the mini-cluster defeats the purpose of purpose of sharded OSD
and seastar. and, it's more complicated, IMHO. i agree that
self-configuring is appealing, but that's not something OSDs should be
concerned. i think they are smart but should not too smart. they are
like a robot army following a protocol. if we want to make them
smarter, we can allow them to report the topology and probably other
stats to mgr or a local agent to do this kind of decisions.

It absolutely is more complicated, and it might be a bad idea.  You and 
Radoslaw are the experts here and know the complications in the seastar 
code better than anyone.  Whatever you guys think is right will be more 
informed that whatever the rest of us think.  I'm just the peanut 
gallery from the sidelines trying to poke holes.  It's a lot easier to 
do it now rather than 2 years from now. :)

ceph-volume, and ceph-ansible with the assumption that the OSD itself
can make better decisions about how to make use of resources.

I oscillate between both approaches, but I think we probably want to
pick one or the other in the long run.  I'm afraid if we try to support
both models long term we'll end up with few advantages and most of the
disadvantages of both.  On the other hand, it might make sense to
prototype both approaches (maybe without any kind of recovery features
or other complexity that would slow development down) just to get a
sense of what each feels like.

Mark

---
[0] https://github.com/ceph/ceph/pull/24945