Re: Module 'telemetry' has experienced an error

Sasha Litvak <alexander.v.litvak@xxxxxxxxx> · Fri, 21 Feb 2020 08:27:00 -0600

Thore,

Thank you for your reply.
Unless the issue was specifically with a ceph telemetry server or their
subnet we had no network issues at that time, at least none was reported by
monitoring or customers.  It is very weird unless the telemetry module may
have a bug of some kind and hangs on its own.  But that is the first time
it happened and the cluster has been up for a year.

On Fri, Feb 21, 2020 at 3:49 AM Thore Krüss <thore@xxxxxxxxxx> wrote:

> On Fri, Feb 21, 2020 at 05:28:12AM -0000, alexander.v.litvak@xxxxxxxxx
> wrote:
> > This evening I was awakened by an error message
> >
> >  cluster:
> >     id:     9b4468b7-5bf2-4964-8aec-4b2f4bee87ad
> >     health: HEALTH_ERR
> >             Module 'telemetry' has failed: ('Connection aborted.',
> error(101, 'Network is unreachable'))
> >
> >   services:
> >
> > I have not seen any other problems with anything else on the cluster.  I
> disabled and enabled the telemetry module and health returned to OK
> status.  Any ideas on what could cause the issue?  As far as I understand,
> telemetry is a module that sends messages to an external ceph server
> outside of the network.
>
> Maybe an uplink issue? We had similar behaviour as we had some trouble
> with a
> core router.
>
> You have been able to disable and enable the module? This failed for me
> with the
> reason that the module had failed (Nautilus). Restarting all mgrs did help.
>
> Still - I'm not sure why this is considered to be a health_err.
>
> Best regards
> Thore
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx