Upgrade from Luminous to Nautilus now one MDS with could not get service secret

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Sun, 28 Mar 2021 12:17:07 -0600

We just upgraded our cluster from Lumious to Nautilus and after a few
days one of our MDS servers is getting:

2021-03-28 18:06:32.304 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 16
2021-03-28 18:06:32.304 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:32.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 16 rtt 0.00400001
2021-03-28 18:06:36.308 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 17
2021-03-28 18:06:36.308 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:36.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 17 rtt 0
2021-03-28 18:06:37.788 7f57c900a700  0 auth: could not find secret_id=34586
2021-03-28 18:06:37.788 7f57c900a700  0 cephx: verify_authorizer could
not get service secret for service mds secret_id=34586
2021-03-28 18:06:37.788 7f57c6004700  5 mds.sun-gcs01-mds02
ms_handle_reset on v2:10.65.101.13:46566/0
2021-03-28 18:06:40.308 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 18
2021-03-28 18:06:40.308 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:40.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 18 rtt 0
2021-03-28 18:06:44.304 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 19
2021-03-28 18:06:44.304 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s

I've tried removing the /var/lib/ceph/mds/ directory and getting the
key again. I've removed the key and generated a new one, I've checked
the clocks between all the nodes. From what I can tell, everything is
good.

We did have an issue where the monitor cluster fell over and would not
boot. We reduced the monitors to a single monitor, disabled cephx,
pulled it off the network and restarted the service a few times which
allowed it to come up. We then expanded back to three mons and
reenabled cephx and everything has been good until this. No other
services seem to be suffering from this and it even appears that the
MDS works okay even with these messages. We would like to figure out
how to resolve this.

Thank you,
Robert LeBlanc

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx