Re: Issues going from 1 to 3 mons

"Jeppesen, Nelson" <Nelson.Jeppesen@xxxxxxxxxx> · Wed, 31 Jul 2013 00:11:56 -0700

Anyone? I’ve tried with  0.67-rc2 and it fails/hangs. My monitor directory is now 200GB and I worried I’ll lose this cluster with only one monitor.

Every time I go from 1 to 2 monitors, the monitors hang and stop responding. I assume it looses quorum.

Here are the steps I take:

rm -rf /var/lib/ceph/mon/ceph-3
sudo mkdir /var/lib/ceph/mon/ceph-3
ceph auth get mon. -o /tmp/auth
ceph mon getmap -o /tmp/map
sudo ceph-mon -i 3 --mkfs --monmap /tmp/map --keyring /tmp/auth
ceph mon add  10.198.141.203:6789
ceph-mon -i 3 --public-addr 10.198.141.203:6789

From: Jeppesen, Nelson 
Sent: Sunday, July 28, 2013 10:32 AM
To: 'Wolfgang Hennerbichler'
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: RE: [ceph-users] Issues going from 1 to 3 mons

I’m still having issues growing from one mon to two mons with .61.7

I see the following on the existing monitor when adding:

…
2013-07-28 10:22:28.898898 mon.0 [INF] pgmap v2799768: 59584 pgs: 59584 active+clean; 15864 MB data, 216 GB used, 40750 GB / 40967 GB avail; 73132B/s rd, 0B/s wr, 11op/s
2013-07-28 10:22:30.503057 7f9d67ba5700  0 monclient: hunting for new mon 

On 2^nd the monitor I just added I see the following hwne starting:

2013-07-28 10:22:39.829381 7f58c1da7700  1 mon.1@0(synchronizing sync( requester state start )) e12 sync_obtain_latest_monmap
2013-07-28 10:22:39.829471 7f58c1da7700  1 mon.1@0(synchronizing sync( requester state start )) e12 sync_obtain_latest_monmap obtained monmap e12
2013-07-28 10:23:09.891284 7f58c25a8700  1 mon.1@0(synchronizing sync( requester state chunks )) e12 sync_timeout mon.1 10.198.141.202:6789/0
2013-07-28 10:23:09.927736 7f58c25a8700  1 mon.1@0(synchronizing sync( requester state chunks )) e12 sync_requester_abort no longer a sync requester
2013-07-28 10:23:39.823436 7f58c25a8700  0 mon.1@0(probing).data_health(0) update_stats avail 94% total 936186880 used 3460376 avail 885170920
2013-07-28 10:24:39.823608 7f58c25a8700  0 mon.1@0(probing).data_health(0) update_stats avail 94% total 936186880 used 3460376 avail 885170920
2013-07-28 10:25:39.823774 7f58c25a8700  0 mon.1@0(probing).data_health(0) update_stats avail 94% total 936186880 used 3460376 avail 885170920
2013-07-28 10:26:39.823960 7f58c25a8700  0 mon.1@0(probing).data_health(0) update_stats avail 94% total 936186880 used 3460376 avail 885170920
2013-07-28 10:27:39.824125 7f58c25a8700  0 mon.1@0(probing).data_health(0) update_stats avail 94% total 936186880 used 3460376 avail 885170920
…

From: Wolfgang Hennerbichler [mailto:wolfgang.hennerbichler@xxxxxxxxxxxxxxxx] 
Sent: Wednesday, July 10, 2013 3:30 AM
To: Jeppesen, Nelson
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Issues going from 1 to 3 mons

Sorry, no updates on my side. My wife got our second baby and I'm busy with reality (changing nappies and stuff)

--  
Sent from my mobile device

On 09.07.2013, at 22:18, "Jeppesen, Nelson" <Nelson.Jeppesen@xxxxxxxxxx> wrote:
Any updates on this? My production cluster has been running on one monitor for a while and I’m a little nervous.

Can I expect a fix in 0.61.5? Thank you.

> (Re-adding the list for future reference)
> 
> Wolfgang, from your log file:
> 
> 2013-06-25 14:58:39.739392 7fa329698780 -1 common/config.cc: In
> function 'void md_config_t::set_val_or_die(const char*, const
> char*)' thread 7fa329698780 time 2013-06-25 14:58:39.738501
> common/config.cc: 621: FAILED assert(ret == 0)
> 
>  ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
>  1: /usr/bin/ceph-mon() [0x660736]
>  2: /usr/bin/ceph-mon() [0x699d66]
>  3: (pick_addresses(CephContext*)+0x93) [0x69a1a3]
>  4: (main()+0x1e3f) [0x48256f]
>  5: (__libc_start_main()+0xed) [0x7fa3278f576d]
>  6: /usr/bin/ceph-mon() [0x4848bd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> This was initially reported on ticket #5205.  Sage fixed it last
> night, for ticket #5195.  Gary reports it fixed using Sage's patch,
> and said fix was backported to the cuttlefish branch.
> 
> It's worth to mention that the cuttlefish branch also contains a
> couple of commits that should boost monitor performance and avoid
> leveldb hangups.
> 
> Looking into #5195 (http://tracker.ceph.com/issues/5195) for more
> info is advised.  Let us know if you decide to try the cuttlefish
> branch (on the monitors) and whether it fixes the issue for you.
> Thanks!
> 
>   -Joao

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com