Hi all, my hopes are down again. The MDS might look busy but I'm not sure its doing anything interesting. I now see a lot of these in the log (stripped the heartbeat messages): 2025-01-11T12:35:50.712+0100 7ff888375700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2025-01-11T11:35:50.713867+0100) 2025-01-11T12:35:51.712+0100 7ff888375700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2025-01-11T11:35:51.714027+0100) 2025-01-11T12:35:52.712+0100 7ff888375700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2025-01-11T11:35:52.714335+0100) 2025-01-11T12:35:53.084+0100 7ff88cb7e700 0 auth: could not find secret_id=51092 2025-01-11T12:35:53.084+0100 7ff88cb7e700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=51092 2025-01-11T12:35:53.353+0100 7ff88cb7e700 0 auth: could not find secret_id=51092 2025-01-11T12:35:53.353+0100 7ff88cb7e700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=51092 2025-01-11T12:35:53.536+0100 7ff88cb7e700 0 auth: could not find secret_id=51092 2025-01-11T12:35:53.536+0100 7ff88cb7e700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=51092 2025-01-11T12:35:53.573+0100 7ff88cb7e700 0 auth: could not find secret_id=51092 2025-01-11T12:35:53.573+0100 7ff88cb7e700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=51092 Looks like the auth key for the MDS expired and cannot be renewed. Is there a grace period for that as well? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Saturday, January 11, 2025 11:41 AM To: Dan van der Ster Cc: Bailey Allison; ceph-users@xxxxxxx Subject: Re: Help needed, ceph fs down due to large stray dir Hi all, new update: after sleeping after the final MDS restart the MDS is doing something! It is still unresponsive, but it does show CPU load of between 150-200% and I really really hope that this is the trimming of stray items. I will try to find out if I get perf to work inside the container. For now, to facilitate trouble shooting, I will add a swap disk to every MDS host just to be on the safe side if stuff fails over. Just to get my hopes back: can someone (from the dev team) let me know if it is expected that an MDS is unresponsive during stray evaluation? Thanks and best regards! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx