Re: MGRs failing once per day and generally slow response times

Caspar Smit <casparsmit@xxxxxxxxxxx> · Thu, 12 Mar 2020 12:02:04 +0100

Janek,

This error already should have put you in the right direction:

"possible clock skew"

Probably the date/times are too far apart on your nodes.
Make sure all your nodes are time synced using NTP

Kind regards,
Caspar

Op wo 11 mrt. 2020 om 09:47 schreef Janek Bevendorff <
janek.bevendorff@xxxxxxxxxxxxx>:

> Additional information: I just found this in the logs of one failed MGR:
>
> 2020-03-11 09:32:55.265 7f59dcb94700 -1 monclient: _check_auth_rotating
> possible clock skew, rotating keys expired way too early (before
> 2020-03-11 08:32:55.268325)
>
> It's the same message that used to appear previously when MGRs crashed,
> so perhaps the overall issue is still the same, just massively accelerated.
>
>
> On 11/03/2020 09:43, Janek Bevendorff wrote:
> > Hi,
> >
> > I've always had some MGR stability issues with daemons crashing at
> > random times, but since the upgrade to 14.2.8 they regularly stop
> > responding after some time until I restart them (which I have to do at
> > least once a day).
> >
> > I noticed right after the upgrade that the prometheus module was
> > entirely unresponsive and ceph fs status took about half a minute to
> > return. Once all the cluster chatter had settled and the PGs had been
> > rebalanced (auto-scale was messing with PGs after the upgarde), it
> > became usable again, but everything's still slower than before.
> > Prometheus takes several seconds to list metrics, ceph fs status takes
> > about 1-2 seconds.
> >
> > However, after some time, MGRs stop responding and are kicked from the
> > list of standbys. With log level 5 all they are writing to the log
> > files is this:
> >
> > 2020-03-11 09:30:40.539 7f8f88984700  4 mgr[prometheus]
> > ::ffff:xxx.xxx.xxx.xxx - - [11/Mar/2020:09:30:40] "GET /metrics
> > HTTP/1.1" 200 - "" "Prometheus/2.15.2"
> > 2020-03-11 09:30:41.371 7f8f9ee62700  4 mgr send_beacon standby
> > 2020-03-11 09:30:43.392 7f8f9ee62700  4 mgr send_beacon standby
> > 2020-03-11 09:30:45.412 7f8f9ee62700  4 mgr send_beacon standby
> > 2020-03-11 09:30:47.436 7f8f9ee62700  4 mgr send_beacon standby
> > 2020-03-11 09:30:49.460 7f8f9ee62700  4 mgr send_beacon standby
> >
> > I have seen another email on this list complaining about slow ceph fs
> > status, I believe this issue is connected.
> >
> > Besides the standard always-on modules I have enabled the prometheus,
> > dashboard, and telemetry modules.
> >
> > Best
> > Janek
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx