Re: Help needed, ceph fs down due to large stray dir

Frank Schilder <frans@xxxxxx> · Sat, 11 Jan 2025 21:43:29 +0000

Hi Eugen,

thanks and yes, let's try one thing at a time. I will report back.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Saturday, January 11, 2025 10:39 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: Help needed, ceph fs down due to large stray dir

Personally, I would only try one change at a time and wait for a
result. Otherwise it can get difficult to tell what exactly helped and
what not.
I have never played with auth_service_ticket_ttl yet, so I can only
refer to the docs here:

> When the Ceph Storage Cluster sends a ticket for authentication to a
> Ceph client, the Ceph Storage Cluster assigns that ticket a Time To
> Live (TTL).

I wouldn't consider a MDS daemons as clients here, so I doubt this
will help. But then again, never tried to change it...

I hope you'll get it back up soon so you can enjoy at least some
portion of the weekend as well. :-)

Zitat von Frank Schilder <frans@xxxxxx>:

> Hi Eugen,
>
> thanks for your reply! Its a long shot, but worth trying. I will
> give it a go.
>
> Since you are following: I also observed cephx timeouts. I'm
> considering to increase the ttl for auth tickets. Do you think
> auth_service_ticket_ttl (default 3600) is the right parameter? If
> so, can I just change it up and down or can this have side effects
> like all daemons loosing their tickets when the ttl is decreased
> from a very large value?
>
> I mainly need the auth ticket for the MDS to remain valid for a long time.
>
> I will wait a bit in case you manage to reply. I would like to
> adjust both values in the next attempt. Otherwise, I will just
> increase beacon down grace.
>
> Thanks a lot again and have a nice Sunday!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Eugen Block <eblock@xxxxxx>
> Sent: Saturday, January 11, 2025 7:59 PM
> To: ceph-users@xxxxxxx
> Subject:  Re: Help needed, ceph fs down due to large stray dir
>
> Hi Frank,
>
> not sure if this already has been mentioned, but this one has 60
> seconds timeout:
>
> mds_beacon_mon_down_grace
>
> ceph config help mds_beacon_mon_down_grace
> mds_beacon_mon_down_grace - tolerance in seconds for missed MDS
> beacons to monitors
>    (secs, advanced)
>    Default: 60
>    Can update at runtime: true
>    Services: [mon]
>
>
> Maybe bumping that up could help here?
>
>
> Zitat von Frank Schilder <frans@xxxxxx>:
>
>> And another small piece of information:
>>
>> Needed to do another restart. This time I managed to capture the
>> approximate length of the period for which the MDS is up and
>> responsive after loading the cache (it reports stats). Its pretty
>> much exactly 60 seconds. This smells like a timeout. Is there any
>> MDS/ceph-fs related timeout with a 60s default somewhere?
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Frank Schilder <frans@xxxxxx>
>> Sent: Saturday, January 11, 2025 12:46 PM
>> To: Dan van der Ster
>> Cc: Bailey Allison; ceph-users@xxxxxxx
>> Subject:  Re: Help needed, ceph fs down due to large stray dir
>>
>> Hi all,
>>
>> my hopes are down again. The MDS might look busy but I'm not sure
>> its doing anything interesting. I now see a lot of these in the log
>> (stripped the heartbeat messages):
>>
>> 2025-01-11T12:35:50.712+0100 7ff888375700 -1 monclient:
>> _check_auth_rotating possible clock skew, rotating keys expired way
>> too early (before 2025-01-11T11:35:50.713867+0100)
>> 2025-01-11T12:35:51.712+0100 7ff888375700 -1 monclient:
>> _check_auth_rotating possible clock skew, rotating keys expired way
>> too early (before 2025-01-11T11:35:51.714027+0100)
>> 2025-01-11T12:35:52.712+0100 7ff888375700 -1 monclient:
>> _check_auth_rotating possible clock skew, rotating keys expired way
>> too early (before 2025-01-11T11:35:52.714335+0100)
>> 2025-01-11T12:35:53.084+0100 7ff88cb7e700  0 auth: could not find
>> secret_id=51092
>> 2025-01-11T12:35:53.084+0100 7ff88cb7e700  0 cephx:
>> verify_authorizer could not get service secret for service mds
>> secret_id=51092
>> 2025-01-11T12:35:53.353+0100 7ff88cb7e700  0 auth: could not find
>> secret_id=51092
>> 2025-01-11T12:35:53.353+0100 7ff88cb7e700  0 cephx:
>> verify_authorizer could not get service secret for service mds
>> secret_id=51092
>> 2025-01-11T12:35:53.536+0100 7ff88cb7e700  0 auth: could not find
>> secret_id=51092
>> 2025-01-11T12:35:53.536+0100 7ff88cb7e700  0 cephx:
>> verify_authorizer could not get service secret for service mds
>> secret_id=51092
>> 2025-01-11T12:35:53.573+0100 7ff88cb7e700  0 auth: could not find
>> secret_id=51092
>> 2025-01-11T12:35:53.573+0100 7ff88cb7e700  0 cephx:
>> verify_authorizer could not get service secret for service mds
>> secret_id=51092
>>
>> Looks like the auth key for the MDS expired and cannot be renewed.
>> Is there a grace period for that as well?
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Frank Schilder <frans@xxxxxx>
>> Sent: Saturday, January 11, 2025 11:41 AM
>> To: Dan van der Ster
>> Cc: Bailey Allison; ceph-users@xxxxxxx
>> Subject:  Re: Help needed, ceph fs down due to large stray dir
>>
>> Hi all,
>>
>> new update: after sleeping after the final MDS restart the MDS is
>> doing something! It is still unresponsive, but it does show CPU load
>> of between 150-200% and I really really hope that this is the
>> trimming of stray items.
>>
>> I will try to find out if I get perf to work inside the container.
>> For now, to facilitate trouble shooting, I will add a swap disk to
>> every MDS host just to be on the safe side if stuff fails over.
>>
>> Just to get my hopes back: can someone (from the dev team) let me
>> know if it is expected that an MDS is unresponsive during stray
>> evaluation?
>>
>> Thanks and best regards!
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx