ceph-mgr crashing due to cephx issues

Wido den Hollander <wido@xxxxxxxx> · Fri, 23 Mar 2018 10:26:14 +0100

Hi,

On a few clusters I've seen this happen randomly and I haven't been able
to reproduce it nor trace back where it came from.

Luminous clusters ranging from 12.2.1 to 12.2.4 have issues where MGRs
go down with these messages in their logs:

Mar 23 09:18:22 mon01 ceph-mgr[2324150]: 2018-03-23 09:18:22.451311
7fb9e8ac7700 -1 monclient: _check_auth_rotating possible clock skew,
rotating keys expired way too early (before 2018-03-23 08:18:22.451287)

The first things you check is time. But in all cases where I've seen
this happen the time is in sync on all systems. Health of the clusters
are HEALTH_OK and nothing is going on.

As this happens randomly I have no idea on where to start debugging it
nor do I have any clue of how this might happen.

Starting the mgr afterwards resolves the issues. It keeps functioning
fine and might go down again after 24 to 48 hours.

The clusters where I've seen this happen were running CentOS 7 or Ubuntu
16.04. I can't pinpoint it to a specific distro or version.

Searching I found some tracker issues with the same messages, but none
of them were recent.

- http://tracker.ceph.com/issues/17170
-
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021707.html

Any ideas on where to start debugging this?

Wido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html