Hi Eugen, thanks for your reply! Its a long shot, but worth trying. I will give it a go. Since you are following: I also observed cephx timeouts. I'm considering to increase the ttl for auth tickets. Do you think auth_service_ticket_ttl (default 3600) is the right parameter? If so, can I just change it up and down or can this have side effects like all daemons loosing their tickets when the ttl is decreased from a very large value? I mainly need the auth ticket for the MDS to remain valid for a long time. I will wait a bit in case you manage to reply. I would like to adjust both values in the next attempt. Otherwise, I will just increase beacon down grace. Thanks a lot again and have a nice Sunday! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: Saturday, January 11, 2025 7:59 PM To: ceph-users@xxxxxxx Subject: Re: Help needed, ceph fs down due to large stray dir Hi Frank, not sure if this already has been mentioned, but this one has 60 seconds timeout: mds_beacon_mon_down_grace ceph config help mds_beacon_mon_down_grace mds_beacon_mon_down_grace - tolerance in seconds for missed MDS beacons to monitors (secs, advanced) Default: 60 Can update at runtime: true Services: [mon] Maybe bumping that up could help here? Zitat von Frank Schilder <frans@xxxxxx>: > And another small piece of information: > > Needed to do another restart. This time I managed to capture the > approximate length of the period for which the MDS is up and > responsive after loading the cache (it reports stats). Its pretty > much exactly 60 seconds. This smells like a timeout. Is there any > MDS/ceph-fs related timeout with a 60s default somewhere? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: Saturday, January 11, 2025 12:46 PM > To: Dan van der Ster > Cc: Bailey Allison; ceph-users@xxxxxxx > Subject: Re: Help needed, ceph fs down due to large stray dir > > Hi all, > > my hopes are down again. The MDS might look busy but I'm not sure > its doing anything interesting. I now see a lot of these in the log > (stripped the heartbeat messages): > > 2025-01-11T12:35:50.712+0100 7ff888375700 -1 monclient: > _check_auth_rotating possible clock skew, rotating keys expired way > too early (before 2025-01-11T11:35:50.713867+0100) > 2025-01-11T12:35:51.712+0100 7ff888375700 -1 monclient: > _check_auth_rotating possible clock skew, rotating keys expired way > too early (before 2025-01-11T11:35:51.714027+0100) > 2025-01-11T12:35:52.712+0100 7ff888375700 -1 monclient: > _check_auth_rotating possible clock skew, rotating keys expired way > too early (before 2025-01-11T11:35:52.714335+0100) > 2025-01-11T12:35:53.084+0100 7ff88cb7e700 0 auth: could not find > secret_id=51092 > 2025-01-11T12:35:53.084+0100 7ff88cb7e700 0 cephx: > verify_authorizer could not get service secret for service mds > secret_id=51092 > 2025-01-11T12:35:53.353+0100 7ff88cb7e700 0 auth: could not find > secret_id=51092 > 2025-01-11T12:35:53.353+0100 7ff88cb7e700 0 cephx: > verify_authorizer could not get service secret for service mds > secret_id=51092 > 2025-01-11T12:35:53.536+0100 7ff88cb7e700 0 auth: could not find > secret_id=51092 > 2025-01-11T12:35:53.536+0100 7ff88cb7e700 0 cephx: > verify_authorizer could not get service secret for service mds > secret_id=51092 > 2025-01-11T12:35:53.573+0100 7ff88cb7e700 0 auth: could not find > secret_id=51092 > 2025-01-11T12:35:53.573+0100 7ff88cb7e700 0 cephx: > verify_authorizer could not get service secret for service mds > secret_id=51092 > > Looks like the auth key for the MDS expired and cannot be renewed. > Is there a grace period for that as well? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: Saturday, January 11, 2025 11:41 AM > To: Dan van der Ster > Cc: Bailey Allison; ceph-users@xxxxxxx > Subject: Re: Help needed, ceph fs down due to large stray dir > > Hi all, > > new update: after sleeping after the final MDS restart the MDS is > doing something! It is still unresponsive, but it does show CPU load > of between 150-200% and I really really hope that this is the > trimming of stray items. > > I will try to find out if I get perf to work inside the container. > For now, to facilitate trouble shooting, I will add a swap disk to > every MDS host just to be on the safe side if stuff fails over. > > Just to get my hopes back: can someone (from the dev team) let me > know if it is expected that an MDS is unresponsive during stray > evaluation? > > Thanks and best regards! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx