Re: PG stuck peering - OSD cephx: verify_authorizer key problem

Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> · Wed, 1 May 2019 13:58:26 +0200

Today problem reappeared.

Restarting mon helps, but it is no solving the issue.

Is there any way how to debug that? Can I dump this keys from MON, from OSD or other components? Can I debug key exchange?

Thank you

On 27/04/2019 10.56, Jan Pekař - Imatic wrote:

On 26/04/2019 21.50, Gregory Farnum wrote:
On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> wrote:
Hi,

yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole
cluster was not responding and IO stopped.

I also notice, that problem was with cephx - all OSDs were reporting the same (even the same number of secret_id)

cephx: verify_authorizer could not get service secret for service osd secret_id=14086
...... conn(0x559e15a50000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086

My questions are:

Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that anyhow?

I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone set,
but I hope MONs are using UTC to distribute keys to clients. Or different timezone between MON and OSD can cause the problem?
Hmm yeah, it's probably not using UTC. (Despite it being good
practice, it's actually not an easy default to adhere to.) cephx
requires synchronized clocks and probably the same timezone (though I
can't swear to that.)

I "fixed" the problem by restarting monitors.

It happened for the second time during last 3 months, so I'm reporting it as issue, that can happen.

I also noticed in all OSDs logs

2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before
2019-04-25 09:06:55.652222)

approximately 7 hours before problem occurred. I can see, that it related to the issue. But why 7 hours? Is there some timeout or grace
period of old keys usage before they are invalidated?
7 hours shouldn't be directly related. IIRC by default a new rotating
key is issued every hour, it gives out the current and next key on
request, and daemons accept keys within a half-hour offset of what
they believe the current time to be. Something like that.
-Greg

If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents of 
this kind.
I changed timezone, restarted services and I will wait...
I forget to mention, that I'm running luminous.

JP

Thank you

With regards

Jan Pekar

--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com