Janek, This error already should have put you in the right direction: "possible clock skew" Probably the date/times are too far apart on your nodes. Make sure all your nodes are time synced using NTP Kind regards, Caspar Op wo 11 mrt. 2020 om 09:47 schreef Janek Bevendorff < janek.bevendorff@xxxxxxxxxxxxx>: > Additional information: I just found this in the logs of one failed MGR: > > 2020-03-11 09:32:55.265 7f59dcb94700 -1 monclient: _check_auth_rotating > possible clock skew, rotating keys expired way too early (before > 2020-03-11 08:32:55.268325) > > It's the same message that used to appear previously when MGRs crashed, > so perhaps the overall issue is still the same, just massively accelerated. > > > On 11/03/2020 09:43, Janek Bevendorff wrote: > > Hi, > > > > I've always had some MGR stability issues with daemons crashing at > > random times, but since the upgrade to 14.2.8 they regularly stop > > responding after some time until I restart them (which I have to do at > > least once a day). > > > > I noticed right after the upgrade that the prometheus module was > > entirely unresponsive and ceph fs status took about half a minute to > > return. Once all the cluster chatter had settled and the PGs had been > > rebalanced (auto-scale was messing with PGs after the upgarde), it > > became usable again, but everything's still slower than before. > > Prometheus takes several seconds to list metrics, ceph fs status takes > > about 1-2 seconds. > > > > However, after some time, MGRs stop responding and are kicked from the > > list of standbys. With log level 5 all they are writing to the log > > files is this: > > > > 2020-03-11 09:30:40.539 7f8f88984700 4 mgr[prometheus] > > ::ffff:xxx.xxx.xxx.xxx - - [11/Mar/2020:09:30:40] "GET /metrics > > HTTP/1.1" 200 - "" "Prometheus/2.15.2" > > 2020-03-11 09:30:41.371 7f8f9ee62700 4 mgr send_beacon standby > > 2020-03-11 09:30:43.392 7f8f9ee62700 4 mgr send_beacon standby > > 2020-03-11 09:30:45.412 7f8f9ee62700 4 mgr send_beacon standby > > 2020-03-11 09:30:47.436 7f8f9ee62700 4 mgr send_beacon standby > > 2020-03-11 09:30:49.460 7f8f9ee62700 4 mgr send_beacon standby > > > > I have seen another email on this list complaining about slow ceph fs > > status, I believe this issue is connected. > > > > Besides the standard always-on modules I have enabled the prometheus, > > dashboard, and telemetry modules. > > > > Best > > Janek > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx