Re: MGRs failing once per day and generally slow response times

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Thu, 19 Mar 2020 17:37:22 +0100

Sorry for nagging, but is there a solution to this? Routinely restarting
my MGRs every few hours isn't how I want to spend my time (although I
guess I could schedule a cron job for that).

On 16/03/2020 09:35, Janek Bevendorff wrote:
> Over the weekend, all five MGRs failed, which means we have no more
> Prometheus monitoring data. We are obviously monitoring the MGR status
> as well, so we can detect the failure, but it's still a pretty serious
> issue. Any ideas as to why this might happen?
>
>
> On 13/03/2020 16:56, Janek Bevendorff wrote:
>> Indeed. I just had another MGR go bye-bye. I don't think host clock
>> skew is the problem.
>>
>>
>> On 13/03/2020 15:29, Anthony D'Atri wrote:
>>> Chrony does converge faster, but I doubt this will solve your
>>> problem if you don’t have quality peers. Or if it’s not really a
>>> time problem.
>>>
>>>> On Mar 13, 2020, at 6:44 AM, Janek Bevendorff
>>>> <janek.bevendorff@xxxxxxxxxxxxx> wrote:
>>>>
>>>> I replaced ntpd with chronyd and will let you know if it changes
>>>> anything. Thanks.
>>>>
>>>>
>>>>> On 13/03/2020 06:25, Konstantin Shalygin wrote:
>>>>>> On 3/13/20 12:57 AM, Janek Bevendorff wrote:
>>>>>> NTPd is running, all the nodes have the same time to the second.
>>>>>> I don't think that is the problem.
>>>>> As always in such cases - try to switch your ntpd to default EL7
>>>>> daemon - chronyd.
>>>>>
>>>>>
>>>>>
>>>>> k
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx