Re: MDS stuck in rejoin

Frank Schilder <frans@xxxxxx> · Sun, 30 Jul 2023 08:52:42 +0000

Hi Xiubo,

it happened again. This time, we might be able to pull logs from the client node. Please take a look at my intermediate action below - thanks!

I am in a bit of a calamity, I'm on holidays with terrible network connection and can't do much. My first priority is securing the cluster to avoid damage caused by this issue. I did an MDS evict by client ID on the MDS reporting the warning with the client ID reported in the warning. For some reason the client got blocked on 2 MDSes after this command, one of these is an ordinary stand-by daemon. Not sure if this is expected.

Main question: is this sufficient to prevent any damaging IO on the cluster? I'm thinking here about the MDS eating through all its RAM until it crashes hard in an irrecoverable state (that was described as a consequence in an old post about this warning). If this is a safe state, I can keep it in this state until I return from holidays.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiubli@xxxxxxxxxx>
Sent: Friday, July 28, 2023 11:37 AM
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: MDS stuck in rejoin

On 7/26/23 22:13, Frank Schilder wrote:
> Hi Xiubo.
>
>> ... I am more interested in the kclient side logs. Just want to
>> know why that oldest request got stuck so long.
> I'm afraid I'm a bad admin in this case. I don't have logs from the host any more, I would have needed the output of dmesg and this is gone. In case it happens again I will try to pull the info out.
>
> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent than our situation. We had no problems with the MDSes, the cache didn't grow and the relevant one was also not put into read-only mode. It was just this warning showing all the time, health was OK otherwise. I think the warning was there for at least 16h before I failed the MDS.
>
> The MDS log contains nothing, this is the only line mentioning this client:
>
> 2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] : client.145678382 does not advance its oldest_client_tid (16121616), 100000 completed requests recorded in session

Okay, if so it's hard to say and dig out what has happened in client why
it didn't advance the tid.

Thanks

- Xiubo

> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx