Thore, Thank you for your reply. Unless the issue was specifically with a ceph telemetry server or their subnet we had no network issues at that time, at least none was reported by monitoring or customers. It is very weird unless the telemetry module may have a bug of some kind and hangs on its own. But that is the first time it happened and the cluster has been up for a year. On Fri, Feb 21, 2020 at 3:49 AM Thore Krüss <thore@xxxxxxxxxx> wrote: > On Fri, Feb 21, 2020 at 05:28:12AM -0000, alexander.v.litvak@xxxxxxxxx > wrote: > > This evening I was awakened by an error message > > > > cluster: > > id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad > > health: HEALTH_ERR > > Module 'telemetry' has failed: ('Connection aborted.', > error(101, 'Network is unreachable')) > > > > services: > > > > I have not seen any other problems with anything else on the cluster. I > disabled and enabled the telemetry module and health returned to OK > status. Any ideas on what could cause the issue? As far as I understand, > telemetry is a module that sends messages to an external ceph server > outside of the network. > > Maybe an uplink issue? We had similar behaviour as we had some trouble > with a > core router. > > You have been able to disable and enable the module? This failed for me > with the > reason that the module had failed (Nautilus). Restarting all mgrs did help. > > Still - I'm not sure why this is considered to be a health_err. > > Best regards > Thore > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx