Re: seastar and 'tame reactor'

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



rocksdb abstracts those synchronization primitives in
https://github.com/facebook/rocksdb/blob/master/port/port.h. and here
is a example port:
https://github.com/facebook/rocksdb/blob/master/port/port_example.h

2018-02-13 23:46 GMT+08:00, Casey Bodley <cbodley@xxxxxxxxxx>:
>
>
> On 02/12/2018 02:40 PM, Allen Samuels wrote:
>> I would think that it ought to be reasonably straightforward to get
>> RocksDB (or other thread-based foreign code) to run under the seastar
>> framework provided that you're able to locate all os-invoking primitives
>> within the foreign code and convert those into calls into your
>> compatibility layer. That layer would have to simulate context switching
>> (relatively easy) as well as provide an implementation of that kernel
>> call. In the case of RocksDB, some of that work has already been done
>> (generally, the file and I/O operations are done through a compatibility
>> layer that's provided as a parameter. I'm not as sure about the
>> synchronization primitives, but it ought to be relatively easy to extend
>> to cover those).
>>
>> Has this been discussed?
>
> I don't think it has, no. I'm not familiar with these rocksdb env
> interfaces, but this sounds promising.
>
>>
>> Allen Samuels
>> R&D Engineering Fellow
>>
>> Western Digital®
>> Email:  allen.samuels@xxxxxxx
>> Office:  +1-408-801-7030
>> Mobile: +1-408-780-6416
>>
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Casey Bodley
>>> Sent: Wednesday, February 07, 2018 9:11 AM
>>> To: kefu chai <tchaikov@xxxxxxxxx>; Josh Durgin <jdurgin@xxxxxxxxxx>
>>> Cc: Adam Emerson <aemerson@xxxxxxxxxx>; Gregory Farnum
>>> <gfarnum@xxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
>>> Subject: Re: seastar and 'tame reactor'
>>>
>>>
>>> On 02/07/2018 11:01 AM, kefu chai wrote:
>>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@xxxxxxxxxx>
>>> wrote:
>>>>> [adding ceph-devel]
>>>>>
>>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>>> Hey Josh,
>>>>>>
>>>>>> I heard you mention in the call yesterday that you're looking into
>>>>>> this part of seastar integration. I was just reading through the
>>>>>> relevant code over the weekend, and wanted to compare notes:
>>>>>>
>>>>>>
>>>>>> in seastar, all cross-core communication goes through lockfree spsc
>>>>>> queues, which are encapsulated by 'class smp_message_queue' in
>>>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on
>>>>>> startup in smp::configure(). early in reactor::run() (which is
>>>>>> effectively each seastar thread's entrypoint), it registers a
>>>>>> smp_poller to poll all of the queues directed at that cpu
>>>>>>
>>>>>> what we need is a way to inject messages into each seastar reactor
>>>>>> from arbitrary/external threads. our requirements are very similar
>>>>>> to
>>>> i think we will have a sharded<osd::PublicService> on each core. in
>>>> each instance of PublicService, we will be listening and serving
>>>> requests from external clients of cluster. the same applies to
>>>> sharded<osd::ClusterService>, which will be responsible for serving
>>>> the requests from its peers in the cluster. the control flow of a
>>>> typical OSD read request from a public RADOS client will look like:
>>>>
>>>> 1. the TCP connection is accepted by one of the listening
>>>> sharded<osd::PublicService>.
>>>> 2. decode the message
>>>> 3. osd encapsulates the request in the message as a future, and submit
>>>> it to another core after hashing the involved pg # to the core #.
>>>> something like (in pseudo code):
>>>>     engine().submit_to(osdmap_shard, [] {
>>>>       return get_newer_osdmap(m->epoch);
>>>>       // need to figure out how to reference a "osdmap service" in
>>>> seastar.
>>>>     }).then([] (auto osdmap) {
>>>>       submit_to(pg_to_shard(m->ops.op.pg, [] {
>>>>         return pg.do_ops(m->ops);
>>>>       });
>>>>     });
>>>> 4. the core serving the involved pg (i.e. pg service) will dequeue
>>>> this request, and use read_dma() call to delegate the aio request to
>>>> the core maintaining the io queue.
>>>> 5. once the aio completes, the PublicService will continue on, with
>>>> the then() block. it will send the response back to client.
>>>>
>>>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
>>>> is good enough for us, i think.
>>>>
>>> Hey Kefu,
>>>
>>> That sounds entirely reasonable, but assumes that everything will be
>>> running
>>> inside of seastar from the start. We've been looking for an incremental
>>> approach that would allow us to start with some subset running inside of
>>> seastar, with a mechanism for communication between that and the osd's
>>> existing threads. One suggestion was to start with just the messenger
>>> inside
>>> of seastar, and gradually move that seastar-to-external-thread boundary
>>> further down the io path as code is refactored to support it. It sounds
>>> unlikely that we'll ever get rocksdb running inside of seastar, so the
>>> objectstore will need its own threads until there's a viable
>>> alternative.
>>>
>>> So the mpsc queue and smp::external_submit_to() interface was a strategy
>>> for passing messages into seastar from arbitrary non-seastar threads.
>>> Communication in the other direction just needs to be non-blocking (my
>>> example just signaled a condition variable without holding its mutex).
>>>
>>> What are your thoughts on the incremental approach?
>>>
>>> Casey
>>>
>>> ps. I'd love to see more thought put into the design of the finished
>>> product,
>>> and your outline is a good start! Avi Kivity @scylladb shared one
>>> suggestion
>>> that I really liked, which was to give each shard of the osd a separate
>>> network
>>> endpoint, and add enough information to the osdmap so that clients could
>>> send their messages directly to the shard that would process them. That
>>> piece can come in later, but could eliminate some of the extra latency
>>> from
>>> your step 3.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the
>>> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay� ʇڙ�,j ��f���h���z�
>> �w������j:+v���w�j�m���� ����zZ+�����ݢj"��!tml=
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux