PG stuck peering - OSD cephx: verify_authorizer key problem

Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> · Fri, 26 Apr 2019 19:55:19 +0200

Hi,

yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole 
cluster was not responding and IO stopped.

I also notice, that problem was with cephx - all OSDs were reporting the same (even the same number of secret_id)

cephx: verify_authorizer could not get service secret for service osd secret_id=14086
...... conn(0x559e15a50000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086

My questions are:

Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that anyhow?

I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone set, 
but I hope MONs are using UTC to distribute keys to clients. Or different timezone between MON and OSD can cause the problem?

I "fixed" the problem by restarting monitors.

It happened for the second time during last 3 months, so I'm reporting it as issue, that can happen.

I also noticed in all OSDs logs

2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 
2019-04-25 09:06:55.652222)

approximately 7 hours before problem occurred. I can see, that it related to the issue. But why 7 hours? Is there some timeout or grace 
period of old keys usage before they are invalidated?

Thank you

With regards

Jan Pekar

--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com