Re: MON slow ops and growing MON store

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Thu, 18 Mar 2021 15:59:22 +0100

We just had the same problem again after a power outage that took out 
62% of our cluster and three out of five MONs. Once everything was back 
up, the MONs started lagging and piling up slow ops while to MON store 
was growing to double-digit gigabytes. It was so bad that I couldn't 
even list the flying ops anymore, because ceph daemon mon.XXX ops did 
not return at all.

Like last time, after I restarted all five MONs, the store size 
decreased and everything went back to normal. I also had to restart MGRs 
and MDSs afterwards. This starts looking like a bug to me.

Janek

On 26/02/2021 15:24, Janek Bevendorff wrote:
Since the full cluster restart and disabling logging to syslog, it's 
not a problem any more (for now).

Unfortunately, just disabling clog_to_monitors didn't have the wanted 
effect when I tried it yesterday. But I also believe that it is 
somehow related. I could not find any specific reason for the incident 
yesterday in the logs besides a few more RocksDB status and compact 
messages than usual, but that's more symptomatic.

On 26/02/2021 13:05, Mykola Golub wrote:
On Thu, Feb 25, 2021 at 08:58:01PM +0100, Janek Bevendorff wrote:

On the first MON, the command doesn’t even return, but I was able to
get a dump from the one I restarted most recently. The oldest ops
look like this:

         {
             "description": "log(1000 entries from seq 17876238 at 
2021-02-25T15:13:20.306487+0100)",
             "initiated_at": "2021-02-25T20:40:34.698932+0100",
             "age": 183.762551121,
             "duration": 183.762599201,
The mon stores cluster log messages in the mon db. You mentioned
problems with osds flooding with log messages. It looks like related.

If you still observe the db growth you may try temporarily disable
clog_to_monitors, i.e. set for all osds:

  clog_to_monitors = false

And see if it stops growing after this and if it helps with the slow
ops (it might make sense to restar mons if some look like get
stuck). You can apply the config option on the fly (without restarting
the osds, e.g with injectargs), but when re-enabling back you will
have to restart the osds to avoid crashes due to this bug [1].

[1] https://tracker.ceph.com/issues/48946

--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx