Hi,
yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole
cluster was not responding and IO stopped.
I also notice, that problem was with cephx - all OSDs were reporting the same (even the same number of secret_id)
cephx: verify_authorizer could not get service secret for service osd secret_id=14086
...... conn(0x559e15a50000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086
My questions are:
Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that anyhow?
I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone set,
but I hope MONs are using UTC to distribute keys to clients. Or different timezone between MON and OSD can cause the problem?
I "fixed" the problem by restarting monitors.
It happened for the second time during last 3 months, so I'm reporting it as issue, that can happen.
I also noticed in all OSDs logs
2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before
2019-04-25 09:06:55.652222)
approximately 7 hours before problem occurred. I can see, that it related to the issue. But why 7 hours? Is there some timeout or grace
period of old keys usage before they are invalidated?
Thank you
With regards
Jan Pekar
--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com