On Wed, Sep 7, 2016 at 2:06 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Tue, 6 Sep 2016, Gregory Farnum wrote: >> On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xxxxxxxx> wrote: >> > On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> >> On Tue, 6 Sep 2016, Haomai Wang wrote: >> >>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> >>> > Hi Haomai! >> >>> > >> >>> > On Sun, 4 Sep 2016, Haomai Wang wrote: >> >>> >> Background: >> >>> >> Each osd has two heartbeat messenger instances to maintain front/back >> >>> >> network available. It brings lots of connections and messages overhead >> >>> >> in scale out cluster. Actually we can combine these heartbeat >> >>> >> exchanges to public/cluster messengers to reduce tons of >> >>> >> connections(resources). >> >>> >> >> >>> >> Then heartbeat message should be OOB and shared the same thread/socket >> >>> >> with normal message channel. So it can exactly represent the heartbeat >> >>> >> role for real IO message. Otherwise, heartbeat channel's status can't >> >>> >> indicate the real IO message channel status. Because different socket >> >>> >> uses different send buffer/recv buffer, if real io message blocked, >> >>> >> oob message may be healthy. >> >>> >> >> >>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in >> >>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could >> >>> >> share the heartbeat message. >> >>> >> >> >>> >> In a real rbd use case env, if we combines these ping/pong messages, >> >>> >> thousands of messages could be avoided which means lots of resources. >> >>> >> >> >>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval >> >>> >> and increase frequency which help a lot to the accurate of cluster >> >>> >> failure detection! >> >>> > >> >>> > I'm very excited to see this move forward! >> >>> > >> >>> >> Design: >> >>> >> >> >>> >> As discussed in Raleigh, we could defines these interfaces: >> >>> >> >> >>> >> int Connection::register_oob_message(identitfy_op, callback, interval); >> >>> >> >> >>> >> Users like Objecter linger ping could register a "callback" which >> >>> >> generate bufferlist used to be carried by heartbeat message. >> >>> >> "interval" indicate the user's oob message's send interval. >> >>> >> >> >>> >> "identitfy_op" indicates who can handle the oob info in peer side. >> >>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define. >> >>> > >> >>> > This looks convenient for the simpler callers, but I worry it won't work >> >>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat >> >>> > info and the code already exists to do the the heartbeat sends. I'm not >> >>> > sure it will simplify to a simple interval. >> >>> >> >>> Hmm, I'm not sure what's the odd locking thing refer to. As we can >> >>> register callback when adding new peer and unregister callback when >> >>> removing peer from "heartbeat_peers". >> >>> >> >>> The main send message construct callback extract from this loop: >> >>> for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin(); >> >>> i != heartbeat_peers.end(); >> >>> ++i) { >> >>> int peer = i->first; >> >>> i->second.last_tx = now; >> >>> if (i->second.first_tx == utime_t()) >> >>> i->second.first_tx = now; >> >>> dout(30) << "heartbeat sending ping to osd." << peer << dendl; >> >>> i->second.con_back->send_message(new MOSDPing(monc->get_fsid(), >> >>> service.get_osdmap()->get_epoch(), >> >>> MOSDPing::PING, >> >>> now)); >> >>> >> >>> if (i->second.con_front) >> >>> i->second.con_front->send_message(new MOSDPing(monc->get_fsid(), >> >>> service.get_osdmap()->get_epoch(), >> >>> MOSDPing::PING, >> >>> now)); >> >>> } >> >>> >> >>> Only "fsid", "osdmap epoch" are required, I don't think it will block. >> >>> Then I think lots of locking/odding things exists on heartbeat >> >>> dispatch/handle process. sending process is clear I guess. >> >> >> >> Yeah, I guess that's fine. I was worried about some dependency between >> >> who we ping and the osdmap epoch in the message (and races adding/removing >> >> heartbeat peers), but I think it doesn't matter. >> >> >> >> Even so, I think it would be good to expose the send_message_oob() >> >> interface, and do this in 2 stages so the two changes are decoupled. >> >> Unless there is some implementation reason why the oob message scheduling >> >> needs to be done inside the messenger? >> > >> > Agreed! we could remove heartbeat messenger firstly! >> > >> >> >> >> sage >> >> >> >>> The advantage to register callback is we can combine multi layers oob >> >>> messages to one. >> >>> >> >>> > >> >>> > An easier first step would be to just define a >> >>> > Connection::send_message_oob(Message*). That would require almost no >> >>> > changes to the calling code, and avoid having to create the timing >> >>> > infrastructure inside AsyncMessenger... >> >>> > >> >>> > sage >> >>> > >> >>> >> void Dispatcher::ms_dispatch_oob(Message*) >> >>> >> >> >>> >> handle the oob message with parsing each oob part. >> >>> >> >> >>> >> So lots of timer control in user's side could be avoided via callback >> >>> >> generator. When sending, OOB message could insert the front of send >> >>> >> message queue but we can't get any help from kernel oob flag since >> >>> >> it's really useless.. >> >>> >> >> >>> >> Any suggestion is welcomed! >> >> Let's keep in mind the challenges of out-of-band messaging over TCP/IP. >> >> Namely, when we discussed this we couldn't figure out any way >> (including the TCP priority stuff, which doesn't work with the >> required semantics — even when it does function) to get traffic to >> actually go out-of-band. IB messaging systems actually have a >> "channels" concept that lets you do genuine OOB transmission that >> skips over queues and other data; TCP doesn't. In fact the best we >> came up with for doing this with Simple/AsyncMessenger was giving the >> Messenger duplicate sockets/queues/etc, which is hardly ideal. >> >> So, maybe we can remove the heartbeat messenger by giving each >> Connection two sockets and queues. That might even work better for the >> AsyncMessenger than it does for SimpleMessenger? >> But any implementation that orders OSD heartbeat messages behind >> ordinary data traffic in kernel or router buffers is probably going to >> fail us. :( > > Oh, good point. I didn't read that paragraph carefully. I think we > should use a second socket connected to the same address for OOB messages. > Or possibly push them over UDP... but we'd need to define retry semantics > in that case. if udp, I think udp hb interval should be less, caller should delegate send logic to connection... > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html