Sorry for nagging, but is there a solution to this? Routinely restarting my MGRs every few hours isn't how I want to spend my time (although I guess I could schedule a cron job for that). On 16/03/2020 09:35, Janek Bevendorff wrote: > Over the weekend, all five MGRs failed, which means we have no more > Prometheus monitoring data. We are obviously monitoring the MGR status > as well, so we can detect the failure, but it's still a pretty serious > issue. Any ideas as to why this might happen? > > > On 13/03/2020 16:56, Janek Bevendorff wrote: >> Indeed. I just had another MGR go bye-bye. I don't think host clock >> skew is the problem. >> >> >> On 13/03/2020 15:29, Anthony D'Atri wrote: >>> Chrony does converge faster, but I doubt this will solve your >>> problem if you don’t have quality peers. Or if it’s not really a >>> time problem. >>> >>>> On Mar 13, 2020, at 6:44 AM, Janek Bevendorff >>>> <janek.bevendorff@xxxxxxxxxxxxx> wrote: >>>> >>>> I replaced ntpd with chronyd and will let you know if it changes >>>> anything. Thanks. >>>> >>>> >>>>> On 13/03/2020 06:25, Konstantin Shalygin wrote: >>>>>> On 3/13/20 12:57 AM, Janek Bevendorff wrote: >>>>>> NTPd is running, all the nodes have the same time to the second. >>>>>> I don't think that is the problem. >>>>> As always in such cases - try to switch your ntpd to default EL7 >>>>> daemon - chronyd. >>>>> >>>>> >>>>> >>>>> k >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx