Re: MON slow ops and growing MON store

Daniel Poelzleithner <poelzi@xxxxxxxxxx> · Mon, 10 Jan 2022 17:40:34 +0100

Hi,

> Like last time, after I restarted all five MONs, the store size
> decreased and everything went back to normal. I also had to restart MGRs
> and MDSs afterwards. This starts looking like a bug to me.

In our case, we had a real database corruption in the rocksdb that 
caused version counters to mismatch the real data.
For such cases I wrote a repair routine that should fix such cases, it 
worked here:

https://github.com/ceph/ceph/pull/44511

kind regards
 Daniel

Janek

On 26/02/2021 15:24, Janek Bevendorff wrote:
Since the full cluster restart and disabling logging to syslog, it's 
not a problem any more (for now).

Unfortunately, just disabling clog_to_monitors didn't have the wanted 
effect when I tried it yesterday. But I also believe that it is 
somehow related. I could not find any specific reason for the incident 
yesterday in the logs besides a few more RocksDB status and compact 
messages than usual, but that's more symptomatic.

On 26/02/2021 13:05, Mykola Golub wrote:
On Thu, Feb 25, 2021 at 08:58:01PM +0100, Janek Bevendorff wrote:

On the first MON, the command doesn’t even return, but I was able to
get a dump from the one I restarted most recently. The oldest ops
look like this:

         {
             "description": "log(1000 entries from seq 17876238 at 
2021-02-25T15:13:20.306487+0100)",
             "initiated_at": "2021-02-25T20:40:34.698932+0100",
             "age": 183.762551121,
             "duration": 183.762599201,
The mon stores cluster log messages in the mon db. You mentioned
problems with osds flooding with log messages. It looks like related.

If you still observe the db growth you may try temporarily disable
clog_to_monitors, i.e. set for all osds:

  clog_to_monitors = false

And see if it stops growing after this and if it helps with the slow
ops (it might make sense to restar mons if some look like get
stuck). You can apply the config option on the fly (without restarting
the osds, e.g with injectargs), but when re-enabling back you will
have to restart the osds to avoid crashes due to this bug [1].

[1] https://tracker.ceph.com/issues/48946

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx