Re: single-threaded seastar-osd

"Cheng, Yingxin" <yingxin.cheng@xxxxxxxxx> · Tue, 8 Jan 2019 17:48:12 +0000

On 22:41 Sat 05 Jan, kefu chai wrote:
> and we need to perform i/o on the core where the connection is
> established. personally, i feel that it's a bad smell, as it's
> complicated and always involves cross-core communications.
> 

In seastar or any other shared-nothing and lockless designs, we have to
decide which core to perform I/O in one connection, and by natural it is
the core where the connection is established and socket created.
However, I think the most important decision here is how we shard all
those connections: A) in current crimson-msgr design we choose to shard
connections by peer-clients in OSD; B) in the single-threaded-OSD design
we actually choose to shard connections by OSDs in the host; and C) the
idea from Patrick and Haomai is to shard connections by PGs in OSD. I'm
not sure but have a feeling that the right decision is related to how we
define the concepts of OSD, PG-shard and PG.

On 09:28 Tue 08 Jan, Mark Nelson wrote:
>On 1/8/19 7:01 AM, kefu chai wrote:
>>On Tue, Jan 8, 2019 at 1:52 AM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>>>Perhaps folks from HW vendors can jump in and fix any misconceptions
>>>here or provide guidance regarding future direction. I think a lot of
>>>this comes down to how we view what an OSD actually represents.  Is it
>>>just some simple/dumb entity (almost like a shard) that executes on some
>>>fraction of hardware that is governed at a higher level?  Alternately,
>>>does the OSD represent a grouping of hardware entities that have some
>>>relationship and "closeness" to one another that can be exploited in
>>>ways that we can't exploit at a higher level?  How much autonomy does an
>>
>>yes, i think it represents a combination of
>>- failure domains for better HA
>>- physically connected components which helps us approach to better
>>performance. please consider NUMA, and probably smart drive programmed
>>with (part of) ceph-osd stack!
>>- physically partitioned/sharded resources which helps us to make
>>better use of them with a better TCO. please consider multi-core CPU,
>>high-throughput NVMe device, NIC supporting SR-IOV, and virtualization
>>techniques.
>
>I guess this is where I think we are being sort of vague and loose with our
>terminology.  These could still be individual daemons, but they don't appear
>to be distinct failure domains anymore (though even in our existing model
>they aren't really distinct in many cases).  From a convenience and maybe
>performance standpoint it could be convenient to have a msgr per core.  It
>might be the right decision.  It seems to push us further into the territory
>of the OSD acting as a storage shard that exists as part of an abstract
>failure domain that the OSD itself doesn't really encapsulate.  Again I'm
>not saying it's the wrong choice, just that I think the overall result might
>be pretty confusing for users to think about.
>

I agree that OSD is firstly a concept of the physical failure domain
which is often associated with disks, secondly it can be associated with
reasonable compute resources that matches the ability of the disk, and
third it can share common utilities within this domain to save compute
resources. For slow HDD we might want to let multiple disks to share
powerful cores, and for persistent memory we may need to allow multiple
cores to consume its fast and low-lat I/O.  Single-threaded OSD solution
seems have the limitation which is hard-coding the compute resources
with the physical failure domain.

PG is a concept of logic failure domain. It is very dynamic thus
obviously not a good choice to hard-code CPU assignment directly with
PG. PG-shard is designed to serve the purpose to have hard associations
with CPU, which can live with PG configuration changes and also allow
OSD to have the flexibility to utilize multiple cores. So I believe it
is a correct direction to keep the current hierarchy of OSD, PG-shard,
and PG.

Then back to the choices of connection sharding policies, I feel the
most intuitive choice is the option C). A) leads to more cross-core
communications in the critical I/O path that will impact performance; B)
will have concerns with excessive connections and inflexible compute
resource allocations; C) looks optimal without all these limitations,
but requires considerable changes in current OSD.

On 19:15 Tue 08 Jan, kefu chai wrote:
>   to allow sharing the connections. we need have a predicable
> connection placement policy, apparently, the plain round-robin can not
> fulfill this requirement. so we have 3 options.
>   a) note down where the connection is established, and always perform
> i/o on that core. and handle the complexity of connection replacement
> across cores.
>   b) change seastar, so that it allows us to plug-in a placement
> policy class to customize the placement. this change should apply to
> both POSIX and native stacks if we want to make the API consistent
> between these two stack implementation.
>   c) change seastar, so that it allows us to move an established
> connection across core. one option is to switch from
> lw_shared_ptr<pollable_fd> to foreign_ptr<lw_shared_ptr<pollable_fd>>
> in posix_connected_socket_impl and its friends

Apparently we need to implement our placement policy if not going with
the single-threaded-OSD solution. a) is compatible with current seastar
but will introduce a lot more cross-core communications in the critical
I/O path; b) is problematic, because when a connection is established,
msgr only knows about its peer IP with a randomly selected port, it
doesn't know exactly which peer it is, or which PG the peer belongs to;
c) is most reasonable to me, but not officially supported by the current
seastar, we need to take pains to modify the framework or find a
workaround.

--
Best,
Yingxin