On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Tue, 6 Sep 2016, Haomai Wang wrote: >> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> > Hi Haomai! >> > >> > On Sun, 4 Sep 2016, Haomai Wang wrote: >> >> Background: >> >> Each osd has two heartbeat messenger instances to maintain front/back >> >> network available. It brings lots of connections and messages overhead >> >> in scale out cluster. Actually we can combine these heartbeat >> >> exchanges to public/cluster messengers to reduce tons of >> >> connections(resources). >> >> >> >> Then heartbeat message should be OOB and shared the same thread/socket >> >> with normal message channel. So it can exactly represent the heartbeat >> >> role for real IO message. Otherwise, heartbeat channel's status can't >> >> indicate the real IO message channel status. Because different socket >> >> uses different send buffer/recv buffer, if real io message blocked, >> >> oob message may be healthy. >> >> >> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in >> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could >> >> share the heartbeat message. >> >> >> >> In a real rbd use case env, if we combines these ping/pong messages, >> >> thousands of messages could be avoided which means lots of resources. >> >> >> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval >> >> and increase frequency which help a lot to the accurate of cluster >> >> failure detection! >> > >> > I'm very excited to see this move forward! >> > >> >> Design: >> >> >> >> As discussed in Raleigh, we could defines these interfaces: >> >> >> >> int Connection::register_oob_message(identitfy_op, callback, interval); >> >> >> >> Users like Objecter linger ping could register a "callback" which >> >> generate bufferlist used to be carried by heartbeat message. >> >> "interval" indicate the user's oob message's send interval. >> >> >> >> "identitfy_op" indicates who can handle the oob info in peer side. >> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define. >> > >> > This looks convenient for the simpler callers, but I worry it won't work >> > as well for OSDPing. There's a bunch of odd locking around the heartbeat >> > info and the code already exists to do the the heartbeat sends. I'm not >> > sure it will simplify to a simple interval. >> >> Hmm, I'm not sure what's the odd locking thing refer to. As we can >> register callback when adding new peer and unregister callback when >> removing peer from "heartbeat_peers". >> >> The main send message construct callback extract from this loop: >> for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin(); >> i != heartbeat_peers.end(); >> ++i) { >> int peer = i->first; >> i->second.last_tx = now; >> if (i->second.first_tx == utime_t()) >> i->second.first_tx = now; >> dout(30) << "heartbeat sending ping to osd." << peer << dendl; >> i->second.con_back->send_message(new MOSDPing(monc->get_fsid(), >> service.get_osdmap()->get_epoch(), >> MOSDPing::PING, >> now)); >> >> if (i->second.con_front) >> i->second.con_front->send_message(new MOSDPing(monc->get_fsid(), >> service.get_osdmap()->get_epoch(), >> MOSDPing::PING, >> now)); >> } >> >> Only "fsid", "osdmap epoch" are required, I don't think it will block. >> Then I think lots of locking/odding things exists on heartbeat >> dispatch/handle process. sending process is clear I guess. > > Yeah, I guess that's fine. I was worried about some dependency between > who we ping and the osdmap epoch in the message (and races adding/removing > heartbeat peers), but I think it doesn't matter. > > Even so, I think it would be good to expose the send_message_oob() > interface, and do this in 2 stages so the two changes are decoupled. > Unless there is some implementation reason why the oob message scheduling > needs to be done inside the messenger? Agreed! we could remove heartbeat messenger firstly! > > sage > >> The advantage to register callback is we can combine multi layers oob >> messages to one. >> >> > >> > An easier first step would be to just define a >> > Connection::send_message_oob(Message*). That would require almost no >> > changes to the calling code, and avoid having to create the timing >> > infrastructure inside AsyncMessenger... >> > >> > sage >> > >> >> void Dispatcher::ms_dispatch_oob(Message*) >> >> >> >> handle the oob message with parsing each oob part. >> >> >> >> So lots of timer control in user's side could be avoided via callback >> >> generator. When sending, OOB message could insert the front of send >> >> message queue but we can't get any help from kernel oob flag since >> >> it's really useless.. >> >> >> >> Any suggestion is welcomed! >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html