Re: periodically delays when one of mons dies

ruslan usifov <ruslan.usifov@xxxxxxxxx> · Fri, 23 Mar 2012 12:57:47 +0400

2012/3/22 Greg Farnum <gregory.farnum@xxxxxxxxxxxxx>:
> On Wednesday, March 21, 2012 at 8:30 AM, ruslan usifov wrote:
>> Hello
>>
>> I'm new to ceph, and perhaps misunderstand some things.
>>
>> I have test configuration with 3 wmvare machines (i test RBD). My setup
>> consist of:
>>
>> 3: mons
>> 3: osd
>>
>>
>> When i kill one mon (simulate fail), time to time (periodicaly) i got little
>> delays when work with RBD device, perhaps this happens when client try
>> failed mon
>
> That's probably the case -- the tools generally pick a random monitor from the list and time out after 15 seconds if it's not responding. If you know a monitor is down you can specify one of the others to connect to with the -m option.
>
>> , is it possible switch off this failed mon until it fully
>> restore.
>
> You could take it out of the daemon's config file, but there's no way for new daemons to avoid trying to talk to down monitors which are in their config (unless you explicitly specify the mon to connect to, as I said above) -- the monitor is the first part of the system they talk to, so there's not a way for Ceph itself to propagate information about down mons.
>
>> perhaps pacemaker will help in this case + failover ip + somthing
>> like proxy which known about mon configuration (i try haproxy but without
>> success: 1 first of all haproxy doesn't know about live or fail mons sof
>> delays will happens also ceph doesn't allow this scheme - ie ceph client
>> check ip address on which it connect and what send to hip mon, and in schem
>> with proxy this values doesn't coincide)
>
> Not quite sure what you're saying here...
> -Greg
>

Sorry for my bad English.

I mean that, if throw pacemaker we organize fault tolerant monitor
(monitor that will work all time - even in fail case), we prevent
delays. In my prevoiuse post i describe how i try to solve this
problem (but with out success), now i try to describe more  сlearly my
view of this:

Lets imagine that we have some thing that every time know about mon
cluster configuration, and behave like mon (bet name is mon proxy) ,
and this thing can simply migrate from node to node in fail case, and
in our cluster all daemons (osd, mds) include client utils, will
connect to this fault tolerant thing (so we doens't have delays).

To mplement this solution i try to use haproxy
(http://haproxy.1wt.eu/) but ceph check ip adrees of mon to which it
try to connect (in src/msg/SimpleMessenger.cc - line 1125):

  ldout(msgr->cct,20) << "connect read peer addr " << paddr << " on
socket " << sd << dendl;
  if (peer_addr != paddr) {
    if (paddr.is_blank_ip() &&
	peer_addr.get_port() == paddr.get_port() &&
	peer_addr.get_nonce() == paddr.get_nonce()) {
      ldout(msgr->cct,0) << "connect claims to be "
	      << paddr << " not " << peer_addr << " - presumably this is the
same node!" << dendl;
    } else {
      ldout(msgr->cct,0) << "connect claims to be "
	      << paddr << " not " << peer_addr << " - wrong node!" << dendl;
      goto fail;
    }
  }

And i got error:-(((  Also haproxy prevents delays only partially:-(
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html