Re: Cluster experiencing complete operational failure, various cephx authentication errors

Stefan Kooman <stefan@xxxxxx> · Mon, 24 Aug 2020 21:28:30 +0200

On 2020-08-24 20:35, Mathijs Smit wrote:
> Hi everyone,
> 
> I have a serious problem which currently exists of my entire Ceph no longer being able to provide service. As if yesterday I added 10 OSD's total 2 per node, the rebalance started and took some IO but seemed to be doing its work. This morning the cluster was still processing the rebalance and taking so much IO that nearly all OSD's where marked as "slow ops" and from there everything went wrong. As attempt to clear as much IO for de rebalance I stoped all the clients and waited for the rebalance to finish. After it finished the cluster remained extremely slow and unusable. Whilst trying to debug I restarted several services and nodes trying to find the problem. Now the cluster has entered a state where multiple OSD's remain slow, various OSD's show a "BADAUTHORIZER" message and the mgr on all nodes also has issues "verify_authorizer".
> 
> I verified all the clocks on all servers and they are sinked to the same NTP service and seem good.

Is there a warning of the ceph monitors that there is a time skew?

Do your monitors log anything?

> 
> Please please please advise as straight 13 hours of debugging got me nowhere.

Can you paste "ceph -s"?

Similar reports from users on this list where a bad behaving monitor (PG
stuck peering - OSD cephx: verify_authorizer key problem), and an issue
in Ceph Nautilus which has been fixed (BADAUTHORIZER in Nautilus).
Messenger v2 was also disabled. There have been fixes is the messenger
v1 and v2.

Do you have any specific ceph auth settings (non default)?

Can you check if the keys of the (new) OSD daemons match with the keys
stored in ceph (ceph auth list, ceph auth export client.$id)?

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx