Re: problems for ceph integrate seastar

kefu chai <tchaikov@xxxxxxxxx> · Wed, 30 May 2018 16:12:56 +0800

copied the etherpad doc to the mail. replied it inlined. so it'd be
easier for us to discuss over mailing list.

On Wed, May 30, 2018 at 3:15 PM, kefu chai <tchaikov@xxxxxxxxx> wrote:
> How Ceph cooperate with Seastar
>
> Ceph use traditional thread which share the physical CPU resource. But
> seastar thread monopolize the CPU core, and have reserved local memory
> for each CPU.
> There are several method to combine Ceph OSD and Seastar:
>
> Scenario 1: Seastar thread + traditional thread
> Seastar thread monopolize one or more logical cpu cores depending on
> how heavy of aysnc io tasks (how many NIC queue) which can be
> configured in configure file, and Ceph threads share the others.
>
> Step 1: Seastar only take the network (take messenger job) and send
> network package to ceph, ceph analysis the message and do its own
> task, such as read/write disk, then return response message to
> seastar, then seastar send the message out through network.  At this
> situation, Ceph keep its own locks and multiple threads setting and
> its PG sequence.  Since network async io is in one seastar thread,
> then for this network part, no locks in seastar thread but maybe need
> one time copy transfer the packet from seastar thread to ceph
> traditional thread (?). Or maybe the memory shared between these
> threads if they are use the same memory allocator, but we will lose
> Seastar memory allocator advantage since Seastar will not reserve
> local memory for seastar threads. Also not sure if still has the
> advantage for DPDK no copy. (?)
>
> Step 2: Seastar take all async IO including network and disk io,
> Seastar will occupy more cpu cores than Step 1 scenario (maybe NIC
> queue+ Disk number), also need seastar communicate with ceph threads.
> When ceph need read/write disk send disk io to seastar threads and
> after it finish, seastar return result to Ceph. Not sure if this will
> bring better performance, since ceph thread need submit Async Disk IO
> to seastar thread to read/write disk. Still has the same memory issue
> as the above one.  If Seastar threads are less than (NIC queue + Disk
> number) we can use “scheduling groups” to put network io and disk io
> to different scheduling groups to balance cpu time for different ios.
> Ceph threads still keep its own locks and multiple threads settings.
>
> In this case, we have two problems to solve: 1. how seastar thread
> communicate with traditional thread.  2. How these two kind of threads
> use memory.
> The communication between Seastar threads and traditional threads can
> be shown as:
>
>                       alien::smp::poll_queues
> alien::submit_to
>      seastar thread <------------------------- message queue
> <----------------   traditional thread (ceph thread)
>
>                      std::condition_variable::notify_all() or std::async()
>      seastar thread               ----------------------------------->
>              traditional thread (ceph thread)
>
> The communication is different in two direction, that is because a
> seastar thread listens for events in the reactor, so to send it
> messages, you need to use a method that the reactor understands.
> Non-seastar threads use standard ways to wait for events, so you can
> use the standard ways to wake them.
>
> Seastar Memory allocator is its own allocator, not for traditional
> threads, currently not support both seastar thread and traditional
> threads use it together. We have the following Choices:
>
> 1.    All the memory is managed by one allocator, can be seastar
> DEFAULT ALOCATOR, or third party allocator (such as Jemalloc or
> tcmalloc), will lose seastar allocator advantage.
> --The "-SEASTAR_DEFAULT_ALLOCATOR" uses in Seastar the system's
> regular malloc() and free(), instead of redefining them.
> 2.    Seastar's memory allocator requiring memory to be freed from the
> same core it was allocated from, if you offload operations to external
> threads who could migrate or share data between cores, it breaks that
> constraint. Even though seastar has a cross cpu free memory mechanism
> (free_cross_cpu), this mechanism is very slow.  But that's not the
> real issue -- the seastar memory allocator doesn't support efficient
> allocation from non-reactor threads, so if the application is also
> allocating significant amounts, it will be very slow.  So we have to
> use one allocator for both seastar threads and traditional threads.
>

 i think we will be using the allocator offered by tcmalloc or glibc,
before switching the whole OSD stack to seastar.

> Scenario 2: Only Seastar thread:
> Step 3: Rewrite ceph in seastar framework. We have two issues in this case.
> 1.    seatar task is short time task aimed at aysnc io, ceph itself
> has long time work to do except for async io. Seastar was designed as
> an I/O machine: seastar programs usually launch and consume
> asynchronous operations, where an asynchronous operation launches
> other operations on another logical core, or starts a disk I/O, or
> performs a network operation (perhaps launching an operation on a
> remote node via RPC).  The code fragments between the operations are
> usually short, and the focus is in keeping many asynchronous
> operations in-flight.
> But there is a good example: Scylla uses Seastar as is long running
> too. They do have extreme latency requirements and thus desire to keep
> the task time short but seastar has threads too or can yield. So we
> can reference.
>
> Seastar threads provide an execution environment where blocking is
> tolerated; you can issue I/O, and wait for it in the same function,
> rather then establishing a callback to be called with
> future<>::then().
> Seastar threads are not the same as operating system threads:
> seastar threads are cooperative; they are never preempted except at
> blocking points (see below)
> seastar threads always run on the same core they were launched on Like
> other seastar code, seastar threads may not issue blocking system
> calls.
> A seastar thread blocking point is any function that returns a
> future<>. you block by calling future<>::get(); this waits for the
> future to become available, and in the meanwhile, other seastar
> threads and seastar non-threaded code may execute.

in long term, we will move over to seastar threads. but the rocksdb
part could be an exception, as it use the POSIX primitives for locks
and multi-tasking. before we have the seastore, we need to preserve
some cores (POSIX threads) for rocksdb. but in short term, we will
follow the model used by test_alien_echo.cc. as you illustrated in the
diagram. see https://github.com/tchaikov/ceph/blob/wip-seastar/src/test/crimson/test_alien_echo.cc.

>
> 2.    In this case, we have to remove many ceph locks (seastar threads
> may not issue blocking system calls, allows it to block (using
> future::get()), ceph arrange disk io in placement groups, in a
> placement group, disk io (read/write to multiple files) should keep
> orders, that is in one PG, disk io come first, its continuation
> execute first, and not like seastar now, which disk io finish first,
> its continuation execute first. So need modify seastar task scheduler
> to guarantee IO sequence in PG.  Also need guarantee for one object
> file its read/write sequence.
> For object file read/write sequence, seastar provide a file stream
> interface. Form the code, for multiple reads it can guarantee
> sequence, but not sure for read/write mixed operations.
> If one OSD only has one reactor thread (one event loop), then the
> thread will not be interrupted after get the network packet except for
> disk io, so at other places we needn’t locks, we can use scheduling
> group for different PG and for one PG we can arrange the promise
> priority queue (first in first out) for the aysnc Disk IO and make
> sure when check async io finish check promise priority first. So first
> execute disk io’s continuation will be executed first.
>    -----------                    -----------                        ----------
>    |promise 3|                    |promise 3|
> |promise 3|                               |
>    |promise 2|                    |promise 2|
> |promise 2|                               |
>    |promise 1|                    |promise 1|
> |promise 1|                               |    priority
>    ----------                     -----------
> -----------                               |
>    scheudling group for pg 1    sheduling group for pg2
> scheduling group for pg3                      V
>

yeah, i think we could follow the model of io_queue. but instead of
using a weighted fair queue, we will have a simple FIFO i/o queue. so
all access to the underlying backend storage will need to go through
this queue. but i don't think we need to hack seastar's scheduler, we
just need to design an internal service in seastar:

 1. use a seastar::async_sharded_service<> (io_sequencer for short)
for maintaining the queues for each pg
 2. invoke_on() that the core where the pg is served, for serving the
i/o request. the func to be "invoked" will enqueue the i/o request and
its "prepare_io" func to the FIFO queue of that PG.
 3. the io_sequencer::run() can be implemented using a
seastar::semaphore, and will use seastar::repeat() to 1) wait on the
FIFO queue,  2) issue the i/o request in the front of that queue, 3)
be blocked on the io response.

do this make sense?

> The seastar reactor will execute promise continuation according to the
> queue (first in first out), need modify seastar scheduler.
> Then how to use seastar threads in this situation? What job can be
> throw to a seastar thread to do?
>

i don't think we will be using {input,output}_stream for serializing
the i/o. it just does not offer the random read/write functions we
need in Ceph for accessing the persistent storage. i don't really
think scheduling group helps with solving the read-write ordering
problem, it is the seastar's way to allocate the CPU cycles to tasks
by dividing tasks into different groups and assign them different
"shares" of CPU. probably i am missing something, though.

-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html