On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > Hi Haomai! > > On Sun, 4 Sep 2016, Haomai Wang wrote: >> Background: >> Each osd has two heartbeat messenger instances to maintain front/back >> network available. It brings lots of connections and messages overhead >> in scale out cluster. Actually we can combine these heartbeat >> exchanges to public/cluster messengers to reduce tons of >> connections(resources). >> >> Then heartbeat message should be OOB and shared the same thread/socket >> with normal message channel. So it can exactly represent the heartbeat >> role for real IO message. Otherwise, heartbeat channel's status can't >> indicate the real IO message channel status. Because different socket >> uses different send buffer/recv buffer, if real io message blocked, >> oob message may be healthy. >> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could >> share the heartbeat message. >> >> In a real rbd use case env, if we combines these ping/pong messages, >> thousands of messages could be avoided which means lots of resources. >> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval >> and increase frequency which help a lot to the accurate of cluster >> failure detection! > > I'm very excited to see this move forward! > >> Design: >> >> As discussed in Raleigh, we could defines these interfaces: >> >> int Connection::register_oob_message(identitfy_op, callback, interval); >> >> Users like Objecter linger ping could register a "callback" which >> generate bufferlist used to be carried by heartbeat message. >> "interval" indicate the user's oob message's send interval. >> >> "identitfy_op" indicates who can handle the oob info in peer side. >> Like "Ping", "OSDPing" or "LingerPing" as the current message define. > > This looks convenient for the simpler callers, but I worry it won't work > as well for OSDPing. There's a bunch of odd locking around the heartbeat > info and the code already exists to do the the heartbeat sends. I'm not > sure it will simplify to a simple interval. Hmm, I'm not sure what's the odd locking thing refer to. As we can register callback when adding new peer and unregister callback when removing peer from "heartbeat_peers". The main send message construct callback extract from this loop: for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin(); i != heartbeat_peers.end(); ++i) { int peer = i->first; i->second.last_tx = now; if (i->second.first_tx == utime_t()) i->second.first_tx = now; dout(30) << "heartbeat sending ping to osd." << peer << dendl; i->second.con_back->send_message(new MOSDPing(monc->get_fsid(), service.get_osdmap()->get_epoch(), MOSDPing::PING, now)); if (i->second.con_front) i->second.con_front->send_message(new MOSDPing(monc->get_fsid(), service.get_osdmap()->get_epoch(), MOSDPing::PING, now)); } Only "fsid", "osdmap epoch" are required, I don't think it will block. Then I think lots of locking/odding things exists on heartbeat dispatch/handle process. sending process is clear I guess. The advantage to register callback is we can combine multi layers oob messages to one. > > An easier first step would be to just define a > Connection::send_message_oob(Message*). That would require almost no > changes to the calling code, and avoid having to create the timing > infrastructure inside AsyncMessenger... > > sage > >> void Dispatcher::ms_dispatch_oob(Message*) >> >> handle the oob message with parsing each oob part. >> >> So lots of timer control in user's side could be avoided via callback >> generator. When sending, OOB message could insert the front of send >> message queue but we can't get any help from kernel oob flag since >> it's really useless.. >> >> Any suggestion is welcomed! >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html