Re: MDS stuck in rejoin

Xiubo Li <xiubli@xxxxxxxxxx> · Mon, 31 Jul 2023 10:12:10 +0800

Hi Frank,

On 7/30/23 16:52, Frank Schilder wrote:
Hi Xiubo,

it happened again. This time, we might be able to pull logs from the client node. Please take a look at my intermediate action below - thanks!

I am in a bit of a calamity, I'm on holidays with terrible network connection and can't do much. My first priority is securing the cluster to avoid damage caused by this issue. I did an MDS evict by client ID on the MDS reporting the warning with the client ID reported in the warning. For some reason the client got blocked on 2 MDSes after this command, one of these is an ordinary stand-by daemon. Not sure if this is expected.

Main question: is this sufficient to prevent any damaging IO on the cluster? I'm thinking here about the MDS eating through all its RAM until it crashes hard in an irrecoverable state (that was described as a consequence in an old post about this warning). If this is a safe state, I can keep it in this state until I return from holidays.

Yeah, I think so.

BTW, are u using the kclients or user space clients ? I checked both 
kclient and libcephfs, it seems buggy in libcephfs, which could cause 
this issue. But for kclient it's okay till now.

Thanks

- Xiubo

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiubli@xxxxxxxxxx>
Sent: Friday, July 28, 2023 11:37 AM
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: MDS stuck in rejoin

On 7/26/23 22:13, Frank Schilder wrote:
Hi Xiubo.

... I am more interested in the kclient side logs. Just want to
know why that oldest request got stuck so long.
I'm afraid I'm a bad admin in this case. I don't have logs from the host any more, I would have needed the output of dmesg and this is gone. In case it happens again I will try to pull the info out.

The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent than our situation. We had no problems with the MDSes, the cache didn't grow and the relevant one was also not put into read-only mode. It was just this warning showing all the time, health was OK otherwise. I think the warning was there for at least 16h before I failed the MDS.

The MDS log contains nothing, this is the only line mentioning this client:

2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] : client.145678382 does not advance its oldest_client_tid (16121616), 100000 completed requests recorded in session
Okay, if so it's hard to say and dig out what has happened in client why
it didn't advance the tid.

Thanks

- Xiubo

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx