Hi Eugen, thanks and yes, let's try one thing at a time. I will report back. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: Saturday, January 11, 2025 10:39 PM To: Frank Schilder Cc: ceph-users@xxxxxxx Subject: Re: Re: Help needed, ceph fs down due to large stray dir Personally, I would only try one change at a time and wait for a result. Otherwise it can get difficult to tell what exactly helped and what not. I have never played with auth_service_ticket_ttl yet, so I can only refer to the docs here: > When the Ceph Storage Cluster sends a ticket for authentication to a > Ceph client, the Ceph Storage Cluster assigns that ticket a Time To > Live (TTL). I wouldn't consider a MDS daemons as clients here, so I doubt this will help. But then again, never tried to change it... I hope you'll get it back up soon so you can enjoy at least some portion of the weekend as well. :-) Zitat von Frank Schilder <frans@xxxxxx>: > Hi Eugen, > > thanks for your reply! Its a long shot, but worth trying. I will > give it a go. > > Since you are following: I also observed cephx timeouts. I'm > considering to increase the ttl for auth tickets. Do you think > auth_service_ticket_ttl (default 3600) is the right parameter? If > so, can I just change it up and down or can this have side effects > like all daemons loosing their tickets when the ttl is decreased > from a very large value? > > I mainly need the auth ticket for the MDS to remain valid for a long time. > > I will wait a bit in case you manage to reply. I would like to > adjust both values in the next attempt. Otherwise, I will just > increase beacon down grace. > > Thanks a lot again and have a nice Sunday! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Eugen Block <eblock@xxxxxx> > Sent: Saturday, January 11, 2025 7:59 PM > To: ceph-users@xxxxxxx > Subject: Re: Help needed, ceph fs down due to large stray dir > > Hi Frank, > > not sure if this already has been mentioned, but this one has 60 > seconds timeout: > > mds_beacon_mon_down_grace > > ceph config help mds_beacon_mon_down_grace > mds_beacon_mon_down_grace - tolerance in seconds for missed MDS > beacons to monitors > (secs, advanced) > Default: 60 > Can update at runtime: true > Services: [mon] > > > Maybe bumping that up could help here? > > > Zitat von Frank Schilder <frans@xxxxxx>: > >> And another small piece of information: >> >> Needed to do another restart. This time I managed to capture the >> approximate length of the period for which the MDS is up and >> responsive after loading the cache (it reports stats). Its pretty >> much exactly 60 seconds. This smells like a timeout. Is there any >> MDS/ceph-fs related timeout with a 60s default somewhere? >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Frank Schilder <frans@xxxxxx> >> Sent: Saturday, January 11, 2025 12:46 PM >> To: Dan van der Ster >> Cc: Bailey Allison; ceph-users@xxxxxxx >> Subject: Re: Help needed, ceph fs down due to large stray dir >> >> Hi all, >> >> my hopes are down again. The MDS might look busy but I'm not sure >> its doing anything interesting. I now see a lot of these in the log >> (stripped the heartbeat messages): >> >> 2025-01-11T12:35:50.712+0100 7ff888375700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way >> too early (before 2025-01-11T11:35:50.713867+0100) >> 2025-01-11T12:35:51.712+0100 7ff888375700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way >> too early (before 2025-01-11T11:35:51.714027+0100) >> 2025-01-11T12:35:52.712+0100 7ff888375700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way >> too early (before 2025-01-11T11:35:52.714335+0100) >> 2025-01-11T12:35:53.084+0100 7ff88cb7e700 0 auth: could not find >> secret_id=51092 >> 2025-01-11T12:35:53.084+0100 7ff88cb7e700 0 cephx: >> verify_authorizer could not get service secret for service mds >> secret_id=51092 >> 2025-01-11T12:35:53.353+0100 7ff88cb7e700 0 auth: could not find >> secret_id=51092 >> 2025-01-11T12:35:53.353+0100 7ff88cb7e700 0 cephx: >> verify_authorizer could not get service secret for service mds >> secret_id=51092 >> 2025-01-11T12:35:53.536+0100 7ff88cb7e700 0 auth: could not find >> secret_id=51092 >> 2025-01-11T12:35:53.536+0100 7ff88cb7e700 0 cephx: >> verify_authorizer could not get service secret for service mds >> secret_id=51092 >> 2025-01-11T12:35:53.573+0100 7ff88cb7e700 0 auth: could not find >> secret_id=51092 >> 2025-01-11T12:35:53.573+0100 7ff88cb7e700 0 cephx: >> verify_authorizer could not get service secret for service mds >> secret_id=51092 >> >> Looks like the auth key for the MDS expired and cannot be renewed. >> Is there a grace period for that as well? >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Frank Schilder <frans@xxxxxx> >> Sent: Saturday, January 11, 2025 11:41 AM >> To: Dan van der Ster >> Cc: Bailey Allison; ceph-users@xxxxxxx >> Subject: Re: Help needed, ceph fs down due to large stray dir >> >> Hi all, >> >> new update: after sleeping after the final MDS restart the MDS is >> doing something! It is still unresponsive, but it does show CPU load >> of between 150-200% and I really really hope that this is the >> trimming of stray items. >> >> I will try to find out if I get perf to work inside the container. >> For now, to facilitate trouble shooting, I will add a swap disk to >> every MDS host just to be on the safe side if stuff fails over. >> >> Just to get my hopes back: can someone (from the dev team) let me >> know if it is expected that an MDS is unresponsive during stray >> evaluation? >> >> Thanks and best regards! >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx