On Tue, 6 Sep 2016, Gregory Farnum wrote: > On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xxxxxxxx> wrote: > > On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >> On Tue, 6 Sep 2016, Haomai Wang wrote: > >>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >>> > Hi Haomai! > >>> > > >>> > On Sun, 4 Sep 2016, Haomai Wang wrote: > >>> >> Background: > >>> >> Each osd has two heartbeat messenger instances to maintain front/back > >>> >> network available. It brings lots of connections and messages overhead > >>> >> in scale out cluster. Actually we can combine these heartbeat > >>> >> exchanges to public/cluster messengers to reduce tons of > >>> >> connections(resources). > >>> >> > >>> >> Then heartbeat message should be OOB and shared the same thread/socket > >>> >> with normal message channel. So it can exactly represent the heartbeat > >>> >> role for real IO message. Otherwise, heartbeat channel's status can't > >>> >> indicate the real IO message channel status. Because different socket > >>> >> uses different send buffer/recv buffer, if real io message blocked, > >>> >> oob message may be healthy. > >>> >> > >>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in > >>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could > >>> >> share the heartbeat message. > >>> >> > >>> >> In a real rbd use case env, if we combines these ping/pong messages, > >>> >> thousands of messages could be avoided which means lots of resources. > >>> >> > >>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval > >>> >> and increase frequency which help a lot to the accurate of cluster > >>> >> failure detection! > >>> > > >>> > I'm very excited to see this move forward! > >>> > > >>> >> Design: > >>> >> > >>> >> As discussed in Raleigh, we could defines these interfaces: > >>> >> > >>> >> int Connection::register_oob_message(identitfy_op, callback, interval); > >>> >> > >>> >> Users like Objecter linger ping could register a "callback" which > >>> >> generate bufferlist used to be carried by heartbeat message. > >>> >> "interval" indicate the user's oob message's send interval. > >>> >> > >>> >> "identitfy_op" indicates who can handle the oob info in peer side. > >>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define. > >>> > > >>> > This looks convenient for the simpler callers, but I worry it won't work > >>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat > >>> > info and the code already exists to do the the heartbeat sends. I'm not > >>> > sure it will simplify to a simple interval. > >>> > >>> Hmm, I'm not sure what's the odd locking thing refer to. As we can > >>> register callback when adding new peer and unregister callback when > >>> removing peer from "heartbeat_peers". > >>> > >>> The main send message construct callback extract from this loop: > >>> for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin(); > >>> i != heartbeat_peers.end(); > >>> ++i) { > >>> int peer = i->first; > >>> i->second.last_tx = now; > >>> if (i->second.first_tx == utime_t()) > >>> i->second.first_tx = now; > >>> dout(30) << "heartbeat sending ping to osd." << peer << dendl; > >>> i->second.con_back->send_message(new MOSDPing(monc->get_fsid(), > >>> service.get_osdmap()->get_epoch(), > >>> MOSDPing::PING, > >>> now)); > >>> > >>> if (i->second.con_front) > >>> i->second.con_front->send_message(new MOSDPing(monc->get_fsid(), > >>> service.get_osdmap()->get_epoch(), > >>> MOSDPing::PING, > >>> now)); > >>> } > >>> > >>> Only "fsid", "osdmap epoch" are required, I don't think it will block. > >>> Then I think lots of locking/odding things exists on heartbeat > >>> dispatch/handle process. sending process is clear I guess. > >> > >> Yeah, I guess that's fine. I was worried about some dependency between > >> who we ping and the osdmap epoch in the message (and races adding/removing > >> heartbeat peers), but I think it doesn't matter. > >> > >> Even so, I think it would be good to expose the send_message_oob() > >> interface, and do this in 2 stages so the two changes are decoupled. > >> Unless there is some implementation reason why the oob message scheduling > >> needs to be done inside the messenger? > > > > Agreed! we could remove heartbeat messenger firstly! > > > >> > >> sage > >> > >>> The advantage to register callback is we can combine multi layers oob > >>> messages to one. > >>> > >>> > > >>> > An easier first step would be to just define a > >>> > Connection::send_message_oob(Message*). That would require almost no > >>> > changes to the calling code, and avoid having to create the timing > >>> > infrastructure inside AsyncMessenger... > >>> > > >>> > sage > >>> > > >>> >> void Dispatcher::ms_dispatch_oob(Message*) > >>> >> > >>> >> handle the oob message with parsing each oob part. > >>> >> > >>> >> So lots of timer control in user's side could be avoided via callback > >>> >> generator. When sending, OOB message could insert the front of send > >>> >> message queue but we can't get any help from kernel oob flag since > >>> >> it's really useless.. > >>> >> > >>> >> Any suggestion is welcomed! > > Let's keep in mind the challenges of out-of-band messaging over TCP/IP. > > Namely, when we discussed this we couldn't figure out any way > (including the TCP priority stuff, which doesn't work with the > required semantics — even when it does function) to get traffic to > actually go out-of-band. IB messaging systems actually have a > "channels" concept that lets you do genuine OOB transmission that > skips over queues and other data; TCP doesn't. In fact the best we > came up with for doing this with Simple/AsyncMessenger was giving the > Messenger duplicate sockets/queues/etc, which is hardly ideal. > > So, maybe we can remove the heartbeat messenger by giving each > Connection two sockets and queues. That might even work better for the > AsyncMessenger than it does for SimpleMessenger? > But any implementation that orders OSD heartbeat messages behind > ordinary data traffic in kernel or router buffers is probably going to > fail us. :( Oh, good point. I didn't read that paragraph carefully. I think we should use a second socket connected to the same address for OOB messages. Or possibly push them over UDP... but we'd need to define retry semantics in that case. sage