Re: MGRs failing once per day and generally slow response times

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Mon, 16 Mar 2020 09:35:14 +0100

Over the weekend, all five MGRs failed, which means we have no more 
Prometheus monitoring data. We are obviously monitoring the MGR status 
as well, so we can detect the failure, but it's still a pretty serious 
issue. Any ideas as to why this might happen?

On 13/03/2020 16:56, Janek Bevendorff wrote:
Indeed. I just had another MGR go bye-bye. I don't think host clock 
skew is the problem.

On 13/03/2020 15:29, Anthony D'Atri wrote:
Chrony does converge faster, but I doubt this will solve your problem 
if you don’t have quality peers. Or if it’s not really a time problem.

On Mar 13, 2020, at 6:44 AM, Janek Bevendorff 
<janek.bevendorff@xxxxxxxxxxxxx> wrote:

I replaced ntpd with chronyd and will let you know if it changes 
anything. Thanks.

On 13/03/2020 06:25, Konstantin Shalygin wrote:
On 3/13/20 12:57 AM, Janek Bevendorff wrote:
NTPd is running, all the nodes have the same time to the second. I 
don't think that is the problem.
As always in such cases - try to switch your ntpd to default EL7 
daemon - chronyd.

k
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Bauhaus-Universität Weimar
Bauhausstr. 9a, Room 308
99423 Weimar, Germany

Phone: +49 (0)3643 - 58 3577
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx