Re: OOB message roll into Messenger interface

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 6 Sep 2016 18:06:31 +0000 (UTC)

On Tue, 6 Sep 2016, Gregory Farnum wrote:
> On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
> > On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> On Tue, 6 Sep 2016, Haomai Wang wrote:
> >>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >>> > Hi Haomai!
> >>> >
> >>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
> >>> >> Background:
> >>> >> Each osd has two heartbeat messenger instances to maintain front/back
> >>> >> network available. It brings lots of connections and messages overhead
> >>> >> in scale out cluster. Actually we can combine these heartbeat
> >>> >> exchanges to public/cluster messengers to reduce tons of
> >>> >> connections(resources).
> >>> >>
> >>> >> Then heartbeat message should be OOB and shared the same thread/socket
> >>> >> with normal message channel. So it can exactly represent the heartbeat
> >>> >> role for real IO message. Otherwise, heartbeat channel's status can't
> >>> >> indicate the real IO message channel status. Because different socket
> >>> >> uses different send buffer/recv buffer, if real io message blocked,
> >>> >> oob message may be healthy.
> >>> >>
> >>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
> >>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
> >>> >> share the heartbeat message.
> >>> >>
> >>> >> In a real rbd use case env, if we combines these ping/pong messages,
> >>> >> thousands of messages could be avoided which means lots of resources.
> >>> >>
> >>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval
> >>> >> and increase frequency which help a lot to the accurate of cluster
> >>> >> failure detection!
> >>> >
> >>> > I'm very excited to see this move forward!
> >>> >
> >>> >> Design:
> >>> >>
> >>> >> As discussed in Raleigh, we could defines these interfaces:
> >>> >>
> >>> >> int Connection::register_oob_message(identitfy_op, callback, interval);
> >>> >>
> >>> >> Users like Objecter linger ping could register a "callback" which
> >>> >> generate bufferlist used to be carried by heartbeat message.
> >>> >> "interval" indicate the user's oob message's send interval.
> >>> >>
> >>> >> "identitfy_op" indicates who can handle the oob info in peer side.
> >>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
> >>> >
> >>> > This looks convenient for the simpler callers, but I worry it won't work
> >>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat
> >>> > info and the code already exists to do the the heartbeat sends.  I'm not
> >>> > sure it will simplify to a simple interval.
> >>>
> >>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
> >>> register callback when adding new peer and unregister callback when
> >>> removing peer from "heartbeat_peers".
> >>>
> >>> The main send message construct callback extract from this loop:
> >>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
> >>>        i != heartbeat_peers.end();
> >>>        ++i) {
> >>>     int peer = i->first;
> >>>     i->second.last_tx = now;
> >>>     if (i->second.first_tx == utime_t())
> >>>       i->second.first_tx = now;
> >>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
> >>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
> >>>  service.get_osdmap()->get_epoch(),
> >>>  MOSDPing::PING,
> >>>  now));
> >>>
> >>>     if (i->second.con_front)
> >>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
> >>>     service.get_osdmap()->get_epoch(),
> >>>     MOSDPing::PING,
> >>>     now));
> >>>   }
> >>>
> >>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
> >>> Then I think lots of locking/odding things exists on heartbeat
> >>> dispatch/handle process. sending process is clear I guess.
> >>
> >> Yeah, I guess that's fine.  I was worried about some dependency between
> >> who we ping and the osdmap epoch in the message (and races adding/removing
> >> heartbeat peers), but I think it doesn't matter.
> >>
> >> Even so, I think it would be good to expose the send_message_oob()
> >> interface, and do this in 2 stages so the two changes are decoupled.
> >> Unless there is some implementation reason why the oob message scheduling
> >> needs to be done inside the messenger?
> >
> > Agreed! we could remove heartbeat messenger firstly!
> >
> >>
> >> sage
> >>
> >>> The advantage to register callback is we can combine multi layers oob
> >>> messages to one.
> >>>
> >>> >
> >>> > An easier first step would be to just define a
> >>> > Connection::send_message_oob(Message*).  That would require almost no
> >>> > changes to the calling code, and avoid having to create the timing
> >>> > infrastructure inside AsyncMessenger...
> >>> >
> >>> > sage
> >>> >
> >>> >> void Dispatcher::ms_dispatch_oob(Message*)
> >>> >>
> >>> >> handle the oob message with parsing each oob part.
> >>> >>
> >>> >> So lots of timer control in user's side could be avoided via callback
> >>> >> generator. When sending, OOB message could insert the front of send
> >>> >> message queue but we can't get any help from kernel oob flag since
> >>> >> it's really useless..
> >>> >>
> >>> >> Any suggestion is welcomed!
> 
> Let's keep in mind the challenges of out-of-band messaging over TCP/IP.
> 
> Namely, when we discussed this we couldn't figure out any way
> (including the TCP priority stuff, which doesn't work with the
> required semantics — even when it does function) to get traffic to
> actually go out-of-band. IB messaging systems actually have a
> "channels" concept that lets you do genuine OOB transmission that
> skips over queues and other data; TCP doesn't. In fact the best we
> came up with for doing this with Simple/AsyncMessenger was giving the
> Messenger duplicate sockets/queues/etc, which is hardly ideal.
> 
> So, maybe we can remove the heartbeat messenger by giving each
> Connection two sockets and queues. That might even work better for the
> AsyncMessenger than it does for SimpleMessenger?
> But any implementation that orders OSD heartbeat messages behind
> ordinary data traffic in kernel or router buffers is probably going to
> fail us. :(

Oh, good point.  I didn't read that paragraph carefully.  I think we 
should use a second socket connected to the same address for OOB messages.  
Or possibly push them over UDP... but we'd need to define retry semantics 
in that case.

sage