Re: messenger refactor notes

Samuel Just <sam.just@xxxxxxxxxxx> · Sat, 9 Nov 2013 10:13:36 -0800



Currently, the messenger delivers messages to the Dispatcher
implementation from a single thread (See src/msg/DispatchQueue.h/cc).
My take away from the performance work so far is that we probably need
client IO related messages to bypass the DispatchQueue bottleneck by
allowing the thread reading the message to call directly into the
Dispatcher.  wip-queueing is a very preliminary branch implementing
this behavior for OSD ops and subops (note, this branch does not work
yet!).  The main change is to add to the Dispatcher interface
ms_can_fast_dispatch and ms_fast_dispatch.  This allows the dispatcher
implementation to designate some messages as safe to dispatch in
parallel without queueing.
-Sam

On Sat, Nov 9, 2013 at 1:18 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> The SimpleMessenger implementation of the Messenger interface has grown
> organically over many years and is one of the cruftier bits of code in
> Ceph.  The idea of building a fresh implementation has come up several
> times in the past, but is now attracting new interest due to a desire to
> support alternative transports to TCP (infiniband!) and a desire to
> improve performance for high-end ceph backends (flash).
>
> Here is a braindump that should hopefully help kickstart the process.
>
> See msg/Messenger.h for the abstract interface.
>
> Note that several bits of this are 'legacy': the send_message, mark_down,
> and other entity_addr_t-based calls are more or less deprecated (although
> a few callers remain).  New code uses get_connection() and the
> Connection*-based calls.  Any effort should focus on these exclusively
> (and on converting old code to use these as needed).
>
> The OSD client-side already uses the new Connection* interface.  The OSD
> <-> OSD communication uses the old calls, but is already wrapped by
> methods in OSDService and can be easily converted by adding a
> hash_map<int,ConnectionRef> to that class.  There are some mark_down()
> calls in OSD that probably need to be wrapped/moved along with that
> change.
>
> As for how the implementation should probably be structured:
>
> The SimpleMessenger code uses 2 (!) threads per conection, which
> is clearly not ideal.  The new approach should probably have a clearly
> defined state model for each connection and be event driven.
>
> Connections can follow a few different modes:
>
>    - client/server lossy:  client always connects to server.  on
>      transport error, we queue a notification on both ends and discard the
>      connection state, close the socket, etc.
>    - client/server lossless: client will transparently reconnect on
>      failure.  flow and ordering of messages is preserved.  server will
>      not initiate reconnect, but will preserve connection state (message
>      sequence numbers) with the expectation that the client will
>      reconnect and continue the bidirectional stream of messages.
>    - lossless peer: nodes can connect to each other.  ordered flow of
>      messages is preserved.
>
> For RADOS, client/server lossy is used between librados clients and OSDs.
> This half of the data path and also the simplest to reason about; I
> suggest starting here.
>
> The OSD to OSD communication is lossless peer.  This is more complicated
> because the connection can be initiated from either end but the stream of
> messages needs to unified into a single stream.  This is the root of all
> of the weird connect_seq logic in SimpleMessenger (that is possibly best
> ignored in favor of an alternative approach).  The OSDs will get unique
> addresses when they start, but peers may discover a new map and open
> connections to each other at slightly different times.. which means
> accepting connections and potentially processing some messages before we
> know whether the connection is valid or old.
>
> Again, because there is some nastiness there, I would probably ignore it
> for now and focus on the client/server lossy mode as a first step.  Once
> we have a nice model for modeling connection state and efficiently
> servicing the network io, we can extend it to the other modes.  FWIW it is
> easy to swap in a new Messenger implementation for just the client-facing
> side of the OSD (for example); see ceph_osd.cc.  Same goes for the client
> code (grep for SimpleMessenger in librados/*).
>
> The last bit is that we should make sure the transport component of this
> is well abstracted.  For simplicity of development, I would suggest making
> the very first prototype simply use TCP, but make sure the interface maps
> well onto the capabilities of the high-performance transports that are
> coming soon (whether its Accelio or verbs or rsockets or whatever seems
> likely to come down the pike anytime soon).  Obviously this is up to
> whoever is doing the work, though!  As long as it is possible to slot in
> other transports without reimplementing this whole module again the next
> time around, I will be happy.
>
> One other note: Sam and others have been doing some performance profiling
> recently that points toward bottlenecks in the message delivery threads.
> I'm not sure what the implications of those findings are (if any) for the
> design of this new code.  Can someone who knows more than I do chime in?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html