Re: ceph-mgr crashing due to cephx issues

Wido den Hollander <wido@xxxxxxxx> · Mon, 26 Mar 2018 11:10:00 +0200

On 03/23/2018 10:26 AM, Wido den Hollander wrote:
> Hi,
> 
> On a few clusters I've seen this happen randomly and I haven't been able
> to reproduce it nor trace back where it came from.
> 
> Luminous clusters ranging from 12.2.1 to 12.2.4 have issues where MGRs
> go down with these messages in their logs:
> 
> Mar 23 09:18:22 mon01 ceph-mgr[2324150]: 2018-03-23 09:18:22.451311
> 7fb9e8ac7700 -1 monclient: _check_auth_rotating possible clock skew,
> rotating keys expired way too early (before 2018-03-23 08:18:22.451287)
> 
> The first things you check is time. But in all cases where I've seen
> this happen the time is in sync on all systems. Health of the clusters
> are HEALTH_OK and nothing is going on.
> 
> As this happens randomly I have no idea on where to start debugging it
> nor do I have any clue of how this might happen.
> 
> Starting the mgr afterwards resolves the issues. It keeps functioning
> fine and might go down again after 24 to 48 hours.
> 
> The clusters where I've seen this happen were running CentOS 7 or Ubuntu
> 16.04. I can't pinpoint it to a specific distro or version.
> 
> Searching I found some tracker issues with the same messages, but none
> of them were recent.
> 
> - http://tracker.ceph.com/issues/17170
> -
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021707.html
> 
> Any ideas on where to start debugging this?

I saw this happen again on a cluster today. Created a ticket for this:
http://tracker.ceph.com/issues/23460

Wido

> 
> Wido
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html