Re: Help needed, ceph fs down due to large stray dir

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Sun, 12 Jan 2025 06:21:49 +0100 (CET)

Hi Frank,

It's possible that certain parameters you modified at some point, which may have helped the MDS to start up, are now slowing down its operation or preventing it from going further. In that case, resetting these parameters to their default values could help. Just a thought.

Another thing... did you have a chance to mount and 'stat' this tree like Bailey suggested on Friday and monitor stray entries with a perf dump or by Spencer's advice (looking at omapkeys of special rados objects in the metadata pool)? Not sure if the 'stat' would help in the current state the MDS is in, but you may want to consider this move if you can.

Regards,
Frédéric.
________________________________
De : Frank Schilder <frans@xxxxxx>
Envoyé : dimanche 12 janvier 2025 00:07
À : Eugen Block
Cc: ceph-users@xxxxxxx 
Objet :  Re: Help needed, ceph fs down due to large stray dir

Hi Eugen,

as promised the result. Unfortunately, increasing this parameter seems not to help. Was worth a try though. I will keep the MDS running and check again tomorrow.

Its really annoying that it doesn't come back. Following the reports of other people who were in a similar situation it should just work after adding swap. I wonder what is so special about our case.

Thanks for your input and have a good night!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Saturday, January 11, 2025 10:43 PM
To: Eugen Block
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: Help needed, ceph fs down due to large stray dir

Hi Eugen,

thanks and yes, let's try one thing at a time. I will report back.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Saturday, January 11, 2025 10:39 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: Help needed, ceph fs down due to large stray dir

Personally, I would only try one change at a time and wait for a
result. Otherwise it can get difficult to tell what exactly helped and
what not.
I have never played with auth_service_ticket_ttl yet, so I can only
refer to the docs here:

> When the Ceph Storage Cluster sends a ticket for authentication to a
> Ceph client, the Ceph Storage Cluster assigns that ticket a Time To
> Live (TTL).

I wouldn't consider a MDS daemons as clients here, so I doubt this
will help. But then again, never tried to change it...

I hope you'll get it back up soon so you can enjoy at least some
portion of the weekend as well. :-)

Zitat von Frank Schilder <frans@xxxxxx>:

> Hi Eugen,
>
> thanks for your reply! Its a long shot, but worth trying. I will
> give it a go.
>
> Since you are following: I also observed cephx timeouts. I'm
> considering to increase the ttl for auth tickets. Do you think
> auth_service_ticket_ttl (default 3600) is the right parameter? If
> so, can I just change it up and down or can this have side effects
> like all daemons loosing their tickets when the ttl is decreased
> from a very large value?
>
> I mainly need the auth ticket for the MDS to remain valid for a long time.
>
> I will wait a bit in case you manage to reply. I would like to
> adjust both values in the next attempt. Otherwise, I will just
> increase beacon down grace.
>
> Thanks a lot again and have a nice Sunday!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Eugen Block <eblock@xxxxxx>
> Sent: Saturday, January 11, 2025 7:59 PM
> To: ceph-users@xxxxxxx
> Subject:  Re: Help needed, ceph fs down due to large stray dir
>
> Hi Frank,
>
> not sure if this already has been mentioned, but this one has 60
> seconds timeout:
>
> mds_beacon_mon_down_grace
>
> ceph config help mds_beacon_mon_down_grace
> mds_beacon_mon_down_grace - tolerance in seconds for missed MDS
> beacons to monitors
>    (secs, advanced)
>    Default: 60
>    Can update at runtime: true
>    Services: [mon]
>
>
> Maybe bumping that up could help here?
>
>
> Zitat von Frank Schilder <frans@xxxxxx>:
>
>> And another small piece of information:
>>
>> Needed to do another restart. This time I managed to capture the
>> approximate length of the period for which the MDS is up and
>> responsive after loading the cache (it reports stats). Its pretty
>> much exactly 60 seconds. This smells like a timeout. Is there any
>> MDS/ceph-fs related timeout with a 60s default somewhere?
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Frank Schilder <frans@xxxxxx>
>> Sent: Saturday, January 11, 2025 12:46 PM
>> To: Dan van der Ster
>> Cc: Bailey Allison; ceph-users@xxxxxxx
>> Subject:  Re: Help needed, ceph fs down due to large stray dir
>>
>> Hi all,
>>
>> my hopes are down again. The MDS might look busy but I'm not sure
>> its doing anything interesting. I now see a lot of these in the log
>> (stripped the heartbeat messages):
>>
>> 2025-01-11T12:35:50.712+0100 7ff888375700 -1 monclient:
>> _check_auth_rotating possible clock skew, rotating keys expired way
>> too early (before 2025-01-11T11:35:50.713867+0100)
>> 2025-01-11T12:35:51.712+0100 7ff888375700 -1 monclient:
>> _check_auth_rotating possible clock skew, rotating keys expired way
>> too early (before 2025-01-11T11:35:51.714027+0100)
>> 2025-01-11T12:35:52.712+0100 7ff888375700 -1 monclient:
>> _check_auth_rotating possible clock skew, rotating keys expired way
>> too early (before 2025-01-11T11:35:52.714335+0100)
>> 2025-01-11T12:35:53.084+0100 7ff88cb7e700  0 auth: could not find
>> secret_id=51092
>> 2025-01-11T12:35:53.084+0100 7ff88cb7e700  0 cephx:
>> verify_authorizer could not get service secret for service mds
>> secret_id=51092
>> 2025-01-11T12:35:53.353+0100 7ff88cb7e700  0 auth: could not find
>> secret_id=51092
>> 2025-01-11T12:35:53.353+0100 7ff88cb7e700  0 cephx:
>> verify_authorizer could not get service secret for service mds
>> secret_id=51092
>> 2025-01-11T12:35:53.536+0100 7ff88cb7e700  0 auth: could not find
>> secret_id=51092
>> 2025-01-11T12:35:53.536+0100 7ff88cb7e700  0 cephx:
>> verify_authorizer could not get service secret for service mds
>> secret_id=51092
>> 2025-01-11T12:35:53.573+0100 7ff88cb7e700  0 auth: could not find
>> secret_id=51092
>> 2025-01-11T12:35:53.573+0100 7ff88cb7e700  0 cephx:
>> verify_authorizer could not get service secret for service mds
>> secret_id=51092
>>
>> Looks like the auth key for the MDS expired and cannot be renewed.
>> Is there a grace period for that as well?
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Frank Schilder <frans@xxxxxx>
>> Sent: Saturday, January 11, 2025 11:41 AM
>> To: Dan van der Ster
>> Cc: Bailey Allison; ceph-users@xxxxxxx
>> Subject:  Re: Help needed, ceph fs down due to large stray dir
>>
>> Hi all,
>>
>> new update: after sleeping after the final MDS restart the MDS is
>> doing something! It is still unresponsive, but it does show CPU load
>> of between 150-200% and I really really hope that this is the
>> trimming of stray items.
>>
>> I will try to find out if I get perf to work inside the container.
>> For now, to facilitate trouble shooting, I will add a swap disk to
>> every MDS host just to be on the safe side if stuff fails over.
>>
>> Just to get my hopes back: can someone (from the dev team) let me
>> know if it is expected that an MDS is unresponsive during stray
>> evaluation?
>>
>> Thanks and best regards!
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx