On 03/23/2018 10:26 AM, Wido den Hollander wrote: > Hi, > > On a few clusters I've seen this happen randomly and I haven't been able > to reproduce it nor trace back where it came from. > > Luminous clusters ranging from 12.2.1 to 12.2.4 have issues where MGRs > go down with these messages in their logs: > > Mar 23 09:18:22 mon01 ceph-mgr[2324150]: 2018-03-23 09:18:22.451311 > 7fb9e8ac7700 -1 monclient: _check_auth_rotating possible clock skew, > rotating keys expired way too early (before 2018-03-23 08:18:22.451287) > > The first things you check is time. But in all cases where I've seen > this happen the time is in sync on all systems. Health of the clusters > are HEALTH_OK and nothing is going on. > > As this happens randomly I have no idea on where to start debugging it > nor do I have any clue of how this might happen. > > Starting the mgr afterwards resolves the issues. It keeps functioning > fine and might go down again after 24 to 48 hours. > > The clusters where I've seen this happen were running CentOS 7 or Ubuntu > 16.04. I can't pinpoint it to a specific distro or version. > > Searching I found some tracker issues with the same messages, but none > of them were recent. > > - http://tracker.ceph.com/issues/17170 > - > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021707.html > > Any ideas on where to start debugging this? I saw this happen again on a cluster today. Created a ticket for this: http://tracker.ceph.com/issues/23460 Wido > > Wido > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html