Re: 14.2.20: Strange monitor problem eating 100% CPU

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Tue, 4 May 2021 16:28:40 +0200

On Tue, May 4, 2021 at 4:21 PM Janne Johansson <icepic.dz@xxxxxxxxx> wrote:
>
> Den tis 4 maj 2021 kl 16:10 skrev Rainer Krienke <krienke@xxxxxxxxxxxxxx>:
> > Hello,
> > I am playing around with a test ceph 14.2.20 cluster. The cluster
> > consists of 4 VMs, each VM has 2 OSDs. The first three VMs vceph1,
> > vceph2 and vceph3 are monitors. vceph1 is also mgr.
> > What I did was quite simple. The cluster is in the state HEALTHY:
> > vceph2: systemctl stop ceph-osd@2
> > # let ceph repair until ceph -s reports cluster is healthy again
> > vceph2: systemctl start ceph-osd@2  # @ 15:39:15, for the logs
> > # cluster reports in cephs -s that 8 OSDs are up and in, then
> > # starts rebalance osd.2
> > vceph2:  ceph -s   # hangs forever also if executed on vceph3 or 4
> > # mon on vceph1 eats 100% CPU permanently, the other mons ~0 %CPU
> >
> > vceph1: systemctl stop ceph-mon@vceph1 # wait ~30 sec to terminate
> > vceph1: systemctl start ceph-mon@vceph1 # Everything is OK again
> >
> > I posted the mon-log to: https://cloud.uni-koblenz.de/s/t8tWjWFAobZb5Hy
> >
> > Strange enough if I set "debug mon 20" before starting the experiment
> > this  bug does not show up. I also tried the very same procedure on the
> > same cluster updated to 15.2.11 but I was unable to reproduce this bug
> > in this ceph version.
>
> I might have run into the same issue recently, except not in a test
> but on a live system,
> also running 14.2.20 like you. We have (for other reasons) some flapping OSDs,
> and repairs/backfills take a lot of time, and while we might have had slightly
> less memory on the mons than we should have, they didn't OOM or anything,
> but we found ourselves in the situation where one mon would eat 100% cpu,
> not log anything of value at all, and the two others would be all but idling.
>
> Restarting the 100%-using mon would finally allow us to get back into the rest
> of the recovery.

Same question as above -- does your mgr log negative progress at level 4 ?

BTW, if you find that this is indeed what's blocking your mons, you
can workaround by setting `ceph progress off` until the fixes are
released.

-- Dan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx