Hi,
I've always had some MGR stability issues with daemons crashing at
random times, but since the upgrade to 14.2.8 they regularly stop
responding after some time until I restart them (which I have to do at
least once a day).
I noticed right after the upgrade that the prometheus module was
entirely unresponsive and ceph fs status took about half a minute to
return. Once all the cluster chatter had settled and the PGs had been
rebalanced (auto-scale was messing with PGs after the upgarde), it
became usable again, but everything's still slower than before.
Prometheus takes several seconds to list metrics, ceph fs status takes
about 1-2 seconds.
However, after some time, MGRs stop responding and are kicked from the
list of standbys. With log level 5 all they are writing to the log files
is this:
2020-03-11 09:30:40.539 7f8f88984700 4 mgr[prometheus]
::ffff:xxx.xxx.xxx.xxx - - [11/Mar/2020:09:30:40] "GET /metrics
HTTP/1.1" 200 - "" "Prometheus/2.15.2"
2020-03-11 09:30:41.371 7f8f9ee62700 4 mgr send_beacon standby
2020-03-11 09:30:43.392 7f8f9ee62700 4 mgr send_beacon standby
2020-03-11 09:30:45.412 7f8f9ee62700 4 mgr send_beacon standby
2020-03-11 09:30:47.436 7f8f9ee62700 4 mgr send_beacon standby
2020-03-11 09:30:49.460 7f8f9ee62700 4 mgr send_beacon standby
I have seen another email on this list complaining about slow ceph fs
status, I believe this issue is connected.
Besides the standard always-on modules I have enabled the prometheus,
dashboard, and telemetry modules.
Best
Janek
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx