Re: OOB message roll into Messenger interface

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



ifaict, sending over UDP brings lossy delivery

will that in general be acceptable?

Matt

----- Original Message -----
> From: "Haomai Wang" <haomai@xxxxxxxx>
> To: "Sage Weil" <sage@xxxxxxxxxxxx>
> Cc: "Gregory Farnum" <gfarnum@xxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx
> Sent: Tuesday, September 6, 2016 10:46:11 PM
> Subject: Re: OOB message roll into Messenger interface
> 
> On Wed, Sep 7, 2016 at 2:06 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Tue, 6 Sep 2016, Gregory Farnum wrote:
> >> On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
> >> > On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> >> On Tue, 6 Sep 2016, Haomai Wang wrote:
> >> >>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> >>> > Hi Haomai!
> >> >>> >
> >> >>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
> >> >>> >> Background:
> >> >>> >> Each osd has two heartbeat messenger instances to maintain
> >> >>> >> front/back
> >> >>> >> network available. It brings lots of connections and messages
> >> >>> >> overhead
> >> >>> >> in scale out cluster. Actually we can combine these heartbeat
> >> >>> >> exchanges to public/cluster messengers to reduce tons of
> >> >>> >> connections(resources).
> >> >>> >>
> >> >>> >> Then heartbeat message should be OOB and shared the same
> >> >>> >> thread/socket
> >> >>> >> with normal message channel. So it can exactly represent the
> >> >>> >> heartbeat
> >> >>> >> role for real IO message. Otherwise, heartbeat channel's status
> >> >>> >> can't
> >> >>> >> indicate the real IO message channel status. Because different
> >> >>> >> socket
> >> >>> >> uses different send buffer/recv buffer, if real io message blocked,
> >> >>> >> oob message may be healthy.
> >> >>> >>
> >> >>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
> >> >>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
> >> >>> >> share the heartbeat message.
> >> >>> >>
> >> >>> >> In a real rbd use case env, if we combines these ping/pong
> >> >>> >> messages,
> >> >>> >> thousands of messages could be avoided which means lots of
> >> >>> >> resources.
> >> >>> >>
> >> >>> >> As we reduce the heartbeat overhead, we can reduce heartbeat
> >> >>> >> interval
> >> >>> >> and increase frequency which help a lot to the accurate of cluster
> >> >>> >> failure detection!
> >> >>> >
> >> >>> > I'm very excited to see this move forward!
> >> >>> >
> >> >>> >> Design:
> >> >>> >>
> >> >>> >> As discussed in Raleigh, we could defines these interfaces:
> >> >>> >>
> >> >>> >> int Connection::register_oob_message(identitfy_op, callback,
> >> >>> >> interval);
> >> >>> >>
> >> >>> >> Users like Objecter linger ping could register a "callback" which
> >> >>> >> generate bufferlist used to be carried by heartbeat message.
> >> >>> >> "interval" indicate the user's oob message's send interval.
> >> >>> >>
> >> >>> >> "identitfy_op" indicates who can handle the oob info in peer side.
> >> >>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message
> >> >>> >> define.
> >> >>> >
> >> >>> > This looks convenient for the simpler callers, but I worry it won't
> >> >>> > work
> >> >>> > as well for OSDPing. There's a bunch of odd locking around the
> >> >>> > heartbeat
> >> >>> > info and the code already exists to do the the heartbeat sends.  I'm
> >> >>> > not
> >> >>> > sure it will simplify to a simple interval.
> >> >>>
> >> >>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
> >> >>> register callback when adding new peer and unregister callback when
> >> >>> removing peer from "heartbeat_peers".
> >> >>>
> >> >>> The main send message construct callback extract from this loop:
> >> >>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
> >> >>>        i != heartbeat_peers.end();
> >> >>>        ++i) {
> >> >>>     int peer = i->first;
> >> >>>     i->second.last_tx = now;
> >> >>>     if (i->second.first_tx == utime_t())
> >> >>>       i->second.first_tx = now;
> >> >>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
> >> >>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
> >> >>>  service.get_osdmap()->get_epoch(),
> >> >>>  MOSDPing::PING,
> >> >>>  now));
> >> >>>
> >> >>>     if (i->second.con_front)
> >> >>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
> >> >>>     service.get_osdmap()->get_epoch(),
> >> >>>     MOSDPing::PING,
> >> >>>     now));
> >> >>>   }
> >> >>>
> >> >>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
> >> >>> Then I think lots of locking/odding things exists on heartbeat
> >> >>> dispatch/handle process. sending process is clear I guess.
> >> >>
> >> >> Yeah, I guess that's fine.  I was worried about some dependency between
> >> >> who we ping and the osdmap epoch in the message (and races
> >> >> adding/removing
> >> >> heartbeat peers), but I think it doesn't matter.
> >> >>
> >> >> Even so, I think it would be good to expose the send_message_oob()
> >> >> interface, and do this in 2 stages so the two changes are decoupled.
> >> >> Unless there is some implementation reason why the oob message
> >> >> scheduling
> >> >> needs to be done inside the messenger?
> >> >
> >> > Agreed! we could remove heartbeat messenger firstly!
> >> >
> >> >>
> >> >> sage
> >> >>
> >> >>> The advantage to register callback is we can combine multi layers oob
> >> >>> messages to one.
> >> >>>
> >> >>> >
> >> >>> > An easier first step would be to just define a
> >> >>> > Connection::send_message_oob(Message*).  That would require almost
> >> >>> > no
> >> >>> > changes to the calling code, and avoid having to create the timing
> >> >>> > infrastructure inside AsyncMessenger...
> >> >>> >
> >> >>> > sage
> >> >>> >
> >> >>> >> void Dispatcher::ms_dispatch_oob(Message*)
> >> >>> >>
> >> >>> >> handle the oob message with parsing each oob part.
> >> >>> >>
> >> >>> >> So lots of timer control in user's side could be avoided via
> >> >>> >> callback
> >> >>> >> generator. When sending, OOB message could insert the front of send
> >> >>> >> message queue but we can't get any help from kernel oob flag since
> >> >>> >> it's really useless..
> >> >>> >>
> >> >>> >> Any suggestion is welcomed!
> >>
> >> Let's keep in mind the challenges of out-of-band messaging over TCP/IP.
> >>
> >> Namely, when we discussed this we couldn't figure out any way
> >> (including the TCP priority stuff, which doesn't work with the
> >> required semantics — even when it does function) to get traffic to
> >> actually go out-of-band. IB messaging systems actually have a
> >> "channels" concept that lets you do genuine OOB transmission that
> >> skips over queues and other data; TCP doesn't. In fact the best we
> >> came up with for doing this with Simple/AsyncMessenger was giving the
> >> Messenger duplicate sockets/queues/etc, which is hardly ideal.
> >>
> >> So, maybe we can remove the heartbeat messenger by giving each
> >> Connection two sockets and queues. That might even work better for the
> >> AsyncMessenger than it does for SimpleMessenger?
> >> But any implementation that orders OSD heartbeat messages behind
> >> ordinary data traffic in kernel or router buffers is probably going to
> >> fail us. :(
> >
> > Oh, good point.  I didn't read that paragraph carefully.  I think we
> > should use a second socket connected to the same address for OOB messages.
> > Or possibly push them over UDP... but we'd need to define retry semantics
> > in that case.
> 
> if udp, I think udp hb interval should be less, caller should delegate
> send logic to connection...
> 
> >
> > sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux