We just had the same problem again after a power outage that took out
62% of our cluster and three out of five MONs. Once everything was back
up, the MONs started lagging and piling up slow ops while to MON store
was growing to double-digit gigabytes. It was so bad that I couldn't
even list the flying ops anymore, because ceph daemon mon.XXX ops did
not return at all.
Like last time, after I restarted all five MONs, the store size
decreased and everything went back to normal. I also had to restart MGRs
and MDSs afterwards. This starts looking like a bug to me.
Janek
On 26/02/2021 15:24, Janek Bevendorff wrote:
Since the full cluster restart and disabling logging to syslog, it's
not a problem any more (for now).
Unfortunately, just disabling clog_to_monitors didn't have the wanted
effect when I tried it yesterday. But I also believe that it is
somehow related. I could not find any specific reason for the incident
yesterday in the logs besides a few more RocksDB status and compact
messages than usual, but that's more symptomatic.
On 26/02/2021 13:05, Mykola Golub wrote:
On Thu, Feb 25, 2021 at 08:58:01PM +0100, Janek Bevendorff wrote:
On the first MON, the command doesn’t even return, but I was able to
get a dump from the one I restarted most recently. The oldest ops
look like this:
{
"description": "log(1000 entries from seq 17876238 at
2021-02-25T15:13:20.306487+0100)",
"initiated_at": "2021-02-25T20:40:34.698932+0100",
"age": 183.762551121,
"duration": 183.762599201,
The mon stores cluster log messages in the mon db. You mentioned
problems with osds flooding with log messages. It looks like related.
If you still observe the db growth you may try temporarily disable
clog_to_monitors, i.e. set for all osds:
clog_to_monitors = false
And see if it stops growing after this and if it helps with the slow
ops (it might make sense to restar mons if some look like get
stuck). You can apply the config option on the fly (without restarting
the osds, e.g with injectargs), but when re-enabling back you will
have to restart the osds to avoid crashes due to this bug [1].
[1] https://tracker.ceph.com/issues/48946
--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany
Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx