Re: seastar and 'tame reactor'

kefu chai <tchaikov@xxxxxxxxx> · Mon, 12 Feb 2018 23:45:21 +0800

On Thu, Feb 8, 2018 at 3:22 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Wed, Feb 7, 2018 at 9:11 AM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>>
>> On 02/07/2018 11:01 AM, kefu chai wrote:
>>>
>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>>>>
>>>> [adding ceph-devel]
>>>>
>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>>
>>>>> Hey Josh,
>>>>>
>>>>> I heard you mention in the call yesterday that you're looking into this
>>>>> part of seastar integration. I was just reading through the relevant
>>>>> code
>>>>> over the weekend, and wanted to compare notes:
>>>>>
>>>>>
>>>>> in seastar, all cross-core communication goes through lockfree spsc
>>>>> queues, which are encapsulated by 'class smp_message_queue' in
>>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on startup
>>>>> in
>>>>> smp::configure(). early in reactor::run() (which is effectively each
>>>>> seastar
>>>>> thread's entrypoint), it registers a smp_poller to poll all of the
>>>>> queues
>>>>> directed at that cpu
>>>>>
>>>>> what we need is a way to inject messages into each seastar reactor from
>>>>> arbitrary/external threads. our requirements are very similar to
>>>
>>> i think we will have a sharded<osd::PublicService> on each core. in
>>> each instance of PublicService, we will be listening and serving
>>> requests from external clients of cluster. the same applies to
>>> sharded<osd::ClusterService>, which will be responsible for serving
>>> the requests from its peers in the cluster. the control flow of a
>>> typical OSD read request from a public RADOS client will look like:
>>>
>>> 1. the TCP connection is accepted by one of the listening
>>> sharded<osd::PublicService>.
>>> 2. decode the message
>>> 3. osd encapsulates the request in the message as a future, and submit
>>> it to another core after hashing the involved pg # to the core #.
>>> something like (in pseudo code):
>>>    engine().submit_to(osdmap_shard, [] {
>>>      return get_newer_osdmap(m->epoch);
>>>      // need to figure out how to reference a "osdmap service" in seastar.
>>>    }).then([] (auto osdmap) {
>>>      submit_to(pg_to_shard(m->ops.op.pg, [] {
>>>        return pg.do_ops(m->ops);
>>>      });
>>>    });
>>> 4. the core serving the involved pg (i.e. pg service) will dequeue
>>> this request, and use read_dma() call to delegate the aio request to
>>> the core maintaining the io queue.
>>> 5. once the aio completes, the PublicService will continue on, with
>>> the then() block. it will send the response back to client.
>>>
>>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
>>> is good enough for us, i think.
>>>
>>
>> Hey Kefu,
>>
>> That sounds entirely reasonable, but assumes that everything will be running
>> inside of seastar from the start. We've been looking for an incremental
>> approach that would allow us to start with some subset running inside of
>> seastar, with a mechanism for communication between that and the osd's
>> existing threads. One suggestion was to start with just the messenger inside
>> of seastar, and gradually move that seastar-to-external-thread boundary
>> further down the io path as code is refactored to support it. It sounds
>> unlikely that we'll ever get rocksdb running inside of seastar, so the
>> objectstore will need its own threads until there's a viable alternative.
>>
>> So the mpsc queue and smp::external_submit_to() interface was a strategy for
>> passing messages into seastar from arbitrary non-seastar threads.
>> Communication in the other direction just needs to be non-blocking (my
>> example just signaled a condition variable without holding its mutex).
>>
>> What are your thoughts on the incremental approach?

yes. if we need send from a thread running a random core, we do need
the mpsc queue, and an smp::external_submit_to() interface, as we
don't have the access to the TLS "local_engine". but this hybrid
approach makes me nervous. as i think seastar is an intrusive
framework. we either embrace it or go with our own work queue model.
let me give it a try to see if we can have a firewall between the
seastar world and the non-seastar world.

>>
>> Casey
>>
>> ps. I'd love to see more thought put into the design of the finished
>> product, and your outline is a good start! Avi Kivity @scylladb shared one
>> suggestion that I really liked, which was to give each shard of the osd a
>> separate network endpoint, and add enough information to the osdmap so that
>> clients could send their messages directly to the shard that would process
>> them. That piece can come in later, but could eliminate some of the extra
>> latency from your step 3.
>
> This is something we've discussed but will want to think about very
> carefully once we have more performance available. Increasing the
> number of (very stateful) connections the OSDs and clients need to
> maintain like that is not something to undertake lightly right now,
> and in fact is the opposite of the multiplexing connections work going
> on for msgr v2. ;)
> -Greg

-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html