Re: OOB message roll into Messenger interface

Haomai Wang <haomai@xxxxxxxx> · Wed, 7 Sep 2016 10:46:11 +0800

On Wed, Sep 7, 2016 at 2:06 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Tue, 6 Sep 2016, Gregory Farnum wrote:
>> On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
>> > On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> >> On Tue, 6 Sep 2016, Haomai Wang wrote:
>> >>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> >>> > Hi Haomai!
>> >>> >
>> >>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
>> >>> >> Background:
>> >>> >> Each osd has two heartbeat messenger instances to maintain front/back
>> >>> >> network available. It brings lots of connections and messages overhead
>> >>> >> in scale out cluster. Actually we can combine these heartbeat
>> >>> >> exchanges to public/cluster messengers to reduce tons of
>> >>> >> connections(resources).
>> >>> >>
>> >>> >> Then heartbeat message should be OOB and shared the same thread/socket
>> >>> >> with normal message channel. So it can exactly represent the heartbeat
>> >>> >> role for real IO message. Otherwise, heartbeat channel's status can't
>> >>> >> indicate the real IO message channel status. Because different socket
>> >>> >> uses different send buffer/recv buffer, if real io message blocked,
>> >>> >> oob message may be healthy.
>> >>> >>
>> >>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
>> >>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
>> >>> >> share the heartbeat message.
>> >>> >>
>> >>> >> In a real rbd use case env, if we combines these ping/pong messages,
>> >>> >> thousands of messages could be avoided which means lots of resources.
>> >>> >>
>> >>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval
>> >>> >> and increase frequency which help a lot to the accurate of cluster
>> >>> >> failure detection!
>> >>> >
>> >>> > I'm very excited to see this move forward!
>> >>> >
>> >>> >> Design:
>> >>> >>
>> >>> >> As discussed in Raleigh, we could defines these interfaces:
>> >>> >>
>> >>> >> int Connection::register_oob_message(identitfy_op, callback, interval);
>> >>> >>
>> >>> >> Users like Objecter linger ping could register a "callback" which
>> >>> >> generate bufferlist used to be carried by heartbeat message.
>> >>> >> "interval" indicate the user's oob message's send interval.
>> >>> >>
>> >>> >> "identitfy_op" indicates who can handle the oob info in peer side.
>> >>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
>> >>> >
>> >>> > This looks convenient for the simpler callers, but I worry it won't work
>> >>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat
>> >>> > info and the code already exists to do the the heartbeat sends.  I'm not
>> >>> > sure it will simplify to a simple interval.
>> >>>
>> >>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
>> >>> register callback when adding new peer and unregister callback when
>> >>> removing peer from "heartbeat_peers".
>> >>>
>> >>> The main send message construct callback extract from this loop:
>> >>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
>> >>>        i != heartbeat_peers.end();
>> >>>        ++i) {
>> >>>     int peer = i->first;
>> >>>     i->second.last_tx = now;
>> >>>     if (i->second.first_tx == utime_t())
>> >>>       i->second.first_tx = now;
>> >>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
>> >>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
>> >>>  service.get_osdmap()->get_epoch(),
>> >>>  MOSDPing::PING,
>> >>>  now));
>> >>>
>> >>>     if (i->second.con_front)
>> >>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
>> >>>     service.get_osdmap()->get_epoch(),
>> >>>     MOSDPing::PING,
>> >>>     now));
>> >>>   }
>> >>>
>> >>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
>> >>> Then I think lots of locking/odding things exists on heartbeat
>> >>> dispatch/handle process. sending process is clear I guess.
>> >>
>> >> Yeah, I guess that's fine.  I was worried about some dependency between
>> >> who we ping and the osdmap epoch in the message (and races adding/removing
>> >> heartbeat peers), but I think it doesn't matter.
>> >>
>> >> Even so, I think it would be good to expose the send_message_oob()
>> >> interface, and do this in 2 stages so the two changes are decoupled.
>> >> Unless there is some implementation reason why the oob message scheduling
>> >> needs to be done inside the messenger?
>> >
>> > Agreed! we could remove heartbeat messenger firstly!
>> >
>> >>
>> >> sage
>> >>
>> >>> The advantage to register callback is we can combine multi layers oob
>> >>> messages to one.
>> >>>
>> >>> >
>> >>> > An easier first step would be to just define a
>> >>> > Connection::send_message_oob(Message*).  That would require almost no
>> >>> > changes to the calling code, and avoid having to create the timing
>> >>> > infrastructure inside AsyncMessenger...
>> >>> >
>> >>> > sage
>> >>> >
>> >>> >> void Dispatcher::ms_dispatch_oob(Message*)
>> >>> >>
>> >>> >> handle the oob message with parsing each oob part.
>> >>> >>
>> >>> >> So lots of timer control in user's side could be avoided via callback
>> >>> >> generator. When sending, OOB message could insert the front of send
>> >>> >> message queue but we can't get any help from kernel oob flag since
>> >>> >> it's really useless..
>> >>> >>
>> >>> >> Any suggestion is welcomed!
>>
>> Let's keep in mind the challenges of out-of-band messaging over TCP/IP.
>>
>> Namely, when we discussed this we couldn't figure out any way
>> (including the TCP priority stuff, which doesn't work with the
>> required semantics — even when it does function) to get traffic to
>> actually go out-of-band. IB messaging systems actually have a
>> "channels" concept that lets you do genuine OOB transmission that
>> skips over queues and other data; TCP doesn't. In fact the best we
>> came up with for doing this with Simple/AsyncMessenger was giving the
>> Messenger duplicate sockets/queues/etc, which is hardly ideal.
>>
>> So, maybe we can remove the heartbeat messenger by giving each
>> Connection two sockets and queues. That might even work better for the
>> AsyncMessenger than it does for SimpleMessenger?
>> But any implementation that orders OSD heartbeat messages behind
>> ordinary data traffic in kernel or router buffers is probably going to
>> fail us. :(
>
> Oh, good point.  I didn't read that paragraph carefully.  I think we
> should use a second socket connected to the same address for OOB messages.
> Or possibly push them over UDP... but we'd need to define retry semantics
> in that case.

if udp, I think udp hb interval should be less, caller should delegate
send logic to connection...

>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html