Den tis 4 maj 2021 kl 16:10 skrev Rainer Krienke <krienke@xxxxxxxxxxxxxx>: > Hello, > I am playing around with a test ceph 14.2.20 cluster. The cluster > consists of 4 VMs, each VM has 2 OSDs. The first three VMs vceph1, > vceph2 and vceph3 are monitors. vceph1 is also mgr. > What I did was quite simple. The cluster is in the state HEALTHY: > vceph2: systemctl stop ceph-osd@2 > # let ceph repair until ceph -s reports cluster is healthy again > vceph2: systemctl start ceph-osd@2 # @ 15:39:15, for the logs > # cluster reports in cephs -s that 8 OSDs are up and in, then > # starts rebalance osd.2 > vceph2: ceph -s # hangs forever also if executed on vceph3 or 4 > # mon on vceph1 eats 100% CPU permanently, the other mons ~0 %CPU > > vceph1: systemctl stop ceph-mon@vceph1 # wait ~30 sec to terminate > vceph1: systemctl start ceph-mon@vceph1 # Everything is OK again > > I posted the mon-log to: https://cloud.uni-koblenz.de/s/t8tWjWFAobZb5Hy > > Strange enough if I set "debug mon 20" before starting the experiment > this bug does not show up. I also tried the very same procedure on the > same cluster updated to 15.2.11 but I was unable to reproduce this bug > in this ceph version. I might have run into the same issue recently, except not in a test but on a live system, also running 14.2.20 like you. We have (for other reasons) some flapping OSDs, and repairs/backfills take a lot of time, and while we might have had slightly less memory on the mons than we should have, they didn't OOM or anything, but we found ourselves in the situation where one mon would eat 100% cpu, not log anything of value at all, and the two others would be all but idling. Restarting the 100%-using mon would finally allow us to get back into the rest of the recovery. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx