Re: problem returning mon back to cluster

Harald Staub <harald.staub@xxxxxxxxx> · Mon, 14 Oct 2019 13:40:19 +0200

Probably same problem here. When I try to add another MON, "ceph health" 
becomes mostly unresponsive. One of the existing ceph-mon processes uses 
100% CPU for several minutes. Tried it on 2 test clusters (14.2.4, 3 
MONs, 5 storage nodes with around 2 hdd osds each). To avoid errors like 
"lease timeout", I temporarily increase "mon lease", from 5 to 50 seconds.

Not sure how bad it is from a customer PoV. But it is a problem by 
itself to be several minutes without "ceph health", when there is an 
increased risk of losing the quorum ...

 Harry

On 13.10.19 20:26, Nikola Ciprich wrote:
dear ceph users and developers,

on one of our production clusters, we got into pretty unpleasant situation.

After rebooting one of the nodes, when trying to start monitor, whole cluster
seems to hang, including IO, ceph -s etc. When this mon is stopped again,
everything seems to continue. Traying to spawn new monitor leads to the same problem
(even on different node).

I had to give up after minutes of outage, since it's unacceptable. I think we had this
problem once in the past on this cluster, but after some (but much shorter) time, monitor
joined and it worked fine since then.

All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now running), I'm
using ceph 13.2.6

Network connection seems to be fine.

Anyone seen similar problem? I'd be very grateful for tips on how to debug and solve this..

for those interested, here's log of one of running monitors with debug_mon set to 10/10:

https://storage.lbox.cz/public/d258d0

if I could provide more info, please let me know

with best regards

nikola ciprich

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com