Re: 14.2.22 dashboard periodically dies and didn't failover

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



But in your case the election is successful to the other mgr, am I correct? So the dash always up for you? Not sure for me why not, maybe I need to disable it really :/

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

-----Original Message-----
From: Peter Lieven <pl@xxxxxxx> 
Sent: Thursday, January 13, 2022 3:13 PM
To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
Cc: Daniel Tönnissen <dt@xxxxxxx>; Marco Horch <horch@xxxxxxx>
Subject: Re:  14.2.22 dashboard periodically dies and didn't failover

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Am 13.01.22 um 09:07 schrieb Szabo, Istvan (Agoda):
> Yes, it's enabled, just died again, this is in the log now:


We suspect that is has sth to do with this backport which got merged in 14.2.22:


https://tracker.ceph.com/issues/48713


It was intended to fix the clock skew issue, but in fact we have never seen it before 14.2.22.

Maybe it broke things.


Our workaround (which works so far) is to disable the prometheus module and use Digital Ocean Ceph Exporter.

https://github.com/digitalocean/ceph_exporter


Best,

Peter


>
> 2022-01-13 13:15:59.330 7fe7e085e700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-01-13 12:15:59.330970)
> 2022-01-13 13:16:05.706 7fe7e2862700 -1 received  signal: Terminated from /usr/lib/systemd/systemd --switched-root --system --deserialize 22  (PID: 1) UID: 0
> 2022-01-13 13:16:05.706 7fe7e2862700 -1 mgr handle_signal *** Got signal Terminated ***
> 2022-01-13 13:16:05.868 7f28ccc7fe40  0 set uid:gid to 167:167 (ceph:ceph)
> 2022-01-13 13:16:05.868 7f28ccc7fe40  0 ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable), process ceph-mgr, pid 1634471
> 2022-01-13 13:16:05.868 7f28ccc7fe40  0 pidfile_write: ignore empty --pid-file
> 2022-01-13 13:16:05.908 7f28ccc7fe40  1 mgr[py] Loading python module 'alerts'
> 2022-01-13 13:16:05.946 7f28ccc7fe40  1 mgr[py] Loading python module 'ansible'
> 2022-01-13 13:16:06.023 7f28ccc7fe40  1 mgr[py] Loading python module 'balancer'
> 2022-01-13 13:16:06.038 7f28ccc7fe40  1 mgr[py] Loading python module 'crash'
> 2022-01-13 13:16:06.063 7f28ccc7fe40  1 mgr[py] Loading python module 'dashboard'
> 2022-01-13 13:16:06.243 7f28ccc7fe40  1 mgr[py] Loading python module 'deepsea'
> 2022-01-13 13:16:06.319 7f28ccc7fe40  1 mgr[py] Loading python module 'devicehealth'
> 2022-01-13 13:16:06.336 7f28ccc7fe40  1 mgr[py] Loading python module 'influx'
> 2022-01-13 13:16:06.350 7f28ccc7fe40  1 mgr[py] Loading python module 'insights'
> 2022-01-13 13:16:06.364 7f28ccc7fe40  1 mgr[py] Loading python module 'iostat'
> 2022-01-13 13:16:06.378 7f28ccc7fe40  1 mgr[py] Loading python module 'localpool'
> 2022-01-13 13:16:06.391 7f28ccc7fe40  1 mgr[py] Loading python module 'orchestrator_cli'
> 2022-01-13 13:16:06.424 7f28ccc7fe40  1 mgr[py] Loading python module 'pg_autoscaler'
> 2022-01-13 13:16:06.468 7f28ccc7fe40  1 mgr[py] Loading python module 'progress'
> 2022-01-13 13:16:06.499 7f28ccc7fe40  1 mgr[py] Loading python module 'prometheus'
> 2022-01-13 13:16:06.598 7f28ccc7fe40  1 mgr[py] Loading python module 'rbd_support'
> 2022-01-13 13:16:06.634 7f28ccc7fe40  1 mgr[py] Loading python module 'restful'
> 2022-01-13 13:16:06.780 7f28ccc7fe40  1 mgr[py] Loading python module 'selftest'
> 2022-01-13 13:16:06.794 7f28ccc7fe40  1 mgr[py] Loading python module 'status'
> 2022-01-13 13:16:06.820 7f28ccc7fe40  1 mgr[py] Loading python module 'telegraf'
> 2022-01-13 13:16:06.843 7f28ccc7fe40  1 mgr[py] Loading python module 'telemetry'
> 2022-01-13 13:16:06.980 7f28ccc7fe40  1 mgr[py] Loading python module 'test_orchestrator'
> 2022-01-13 13:16:07.022 7f28ccc7fe40  1 mgr[py] Loading python module 'volumes'
> 2022-01-13 13:16:07.073 7f28ccc7fe40  1 mgr[py] Loading python module 'zabbix'
> 2022-01-13 13:16:07.091 7f28b9201700  1 mgr load Constructed class from module: dashboard
> 2022-01-13 13:16:07.091 7f28b9201700  1 mgr load Constructed class from module: prometheus
> 2022-01-13 13:16:07.092 7f28b8a00700  0 ms_deliver_dispatch: unhandled message 0x55af1ac5e800 mon_map magic: 0 v1 from mon.0 v2:10.121.58.220:3300/0
> 2022-01-13 13:16:07.093 7f28b8a00700  0 client.0 ms_handle_reset on v2:10.121.58.222:6800/1141825
> 2022-01-13 13:31:08.099 7f28b8a00700  0 client.0 ms_handle_reset on v2:10.121.58.222:6800/1141825
> 2022-01-13 13:46:08.104 7f28b8a00700  0 client.0 ms_handle_reset on v2:10.121.58.222:6800/1141825
> 2022-01-13 14:01:08.113 7f28b8a00700  0 client.0 ms_handle_reset on v2:10.121.58.222:6800/1141825
> 2022-01-13 14:16:08.119 7f28b8a00700  0 client.0 ms_handle_reset on v2:10.121.58.222:6800/1141825
> 2022-01-13 14:31:08.125 7f28b8a00700  0 client.0 ms_handle_reset on v2:10.121.58.222:6800/1141825
> 2022-01-13 14:46:08.132 7f28b8a00700  0 client.0 ms_handle_reset on v2:10.121.58.222:6800/1141825
> 2022-01-13 15:01:08.136 7f28b8a00700  0 client.0 ms_handle_reset on v2:10.121.58.222:6800/1141825
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx
> ---------------------------------------------------
>
> -----Original Message-----
> From: Peter Lieven <pl@xxxxxxx>
> Sent: Thursday, January 13, 2022 2:54 PM
> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
> Subject: Re:  14.2.22 dashboard periodically dies and didn't failover
>
> Email received from the internet. If in doubt, don't click any link nor open any attachment !
> ________________________________
>
> Am 13.01.22 um 08:37 schrieb Szabo, Istvan (Agoda):
>> Hi,
>>
>> I can see a lot of message regarding the rotating key, but not sure this is the root cause.
>>
>> 2022-01-13 03:21:57.156 7fe7e085e700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-01-13 02:21:57.156836)
>> 2022-01-13 03:22:01.484 7fe7e2862700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror  (PID: 1572574) UID: 0
>>
>> I have 3 mon with 3 mgr and on al mgr the dashboard installed.
>>
>> When the mgr dies on the first node, it didn't failover to the other 2, only the service restart can solve the issue.
>>
>> Any idea?
>
> We have seen a similar issue starting with 14.2.22. We have a slightly different situation. The mgr gets stuck and the cluster elects another mgr as primary, but
>
> the original primary does not recover. The process is stuck. I have a (large) backtrace if someone is interested.
>
> For us it seems that the prometheus exporter module is the cause. Do you have it enabled?
>
>
> Peter
>
>
>


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux