Re: MDS stuck in rejoin

Xiubo Li <xiubli@xxxxxxxxxx> · Fri, 10 Nov 2023 08:47:43 +0800

On 11/9/23 23:41, Frank Schilder wrote:
Hi Xiubo,

great! I'm not sure if we observed this particular issue, but we did have the oldest_client_tid updates not advancing  message in a context that might re related.

If this fix is not too large, it would be really great if it could be included in the last Pacific point release.

Yeah, this will be backport after it getting merged. But for kclient we 
still need another patch.

Thanks

- Xiubo

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiubli@xxxxxxxxxx>
Sent: Wednesday, November 8, 2023 1:38 AM
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: MDS stuck in rejoin

Hi Frank,

Recently I found a new possible case could cause this, please see
https://github.com/ceph/ceph/pull/54259. This is just a ceph side fix,
after this we need to fix it in kclient too, which hasn't done yet.

Thanks

- Xiubo

On 8/8/23 17:44, Frank Schilder wrote:
Dear Xiubo,

the nearfull pool is an RBD pool and has nothing to do with the file system. All pools for the file system have plenty of capacity.

I think we have an idea what kind of workload caused the issue. We had a user run a computation that reads the same file over and over again. He started 100 such jobs in parallel and our storage servers were at 400% load. I saw 167K read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 11K. Clearly, most of this was served from RAM.

It is possible that this extreme load situation triggered a race that remained undetected/unreported. There is literally no related message in any logs near the time the warning started popping up. It shows up out of nowhere.

We asked the user to change his workflow to use local RAM disk for the input files. I don't think we can reproduce the problem anytime soon.

About the bug fixes, I'm eagerly waiting for this and another one. Any idea when they might show up in distro kernels?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiubli@xxxxxxxxxx>
Sent: Tuesday, August 8, 2023 2:57 AM
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: MDS stuck in rejoin

On 8/7/23 21:54, Frank Schilder wrote:
Dear Xiubo,

I managed to collect some information. It looks like there is nothing in the dmesg log around the time the client failed to advance its TID. I collected short snippets around the critical time below. I have full logs in case you are interested. Its large files, I will need to do an upload for that.

I also have a dump of "mds session ls" output for clients that showed the same issue later. Unfortunately, no consistent log information for a single incident.

Here the summary, please let me know if uploading the full package makes sense:

- Status:

On July 29, 2023

ceph status/df/pool stats/health detail at 01:05:14:
     cluster:
       health: HEALTH_WARN
               1 pools nearfull

ceph status/df/pool stats/health detail at 01:05:28:
     cluster:
       health: HEALTH_WARN
               1 clients failing to advance oldest client/flush tid
               1 pools nearfull
Okay, then this could be the root cause.

If the pool nearful it could block flushing the journal logs to the pool
and then the MDS couldn't safe reply to the requests and then block them
like this.

Could you fix the pool nearful issue first and then check could you see
it again ?

[...]

On July 31, 2023

ceph status/df/pool stats/health detail at 10:36:16:
     cluster:
       health: HEALTH_WARN
               1 clients failing to advance oldest client/flush tid
               1 pools nearfull

     cluster:
       health: HEALTH_WARN
               1 pools nearfull

- client evict command (date, time, command):

2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457

We have a 1h time difference between the date stamp of the command and the dmesg date stamps. However, there seems to be a weird 10min delay from issuing the evict command until it shows up in dmesg on the client.

- dmesg:

[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN)
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22
This is a known bug and we have fixed it in both kclient and ceph side:

https://tracker.ceph.com/issues/61200

https://tracker.ceph.com/issues/61217

Thanks

- Xiubo

[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sun Jul 30 16:37:55 2023] slurm.epilog.cl (43668): drop_caches: 3
[Mon Jul 31 01:00:20 2023] slurm.epilog.cl (73347): drop_caches: 3
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 socket closed (con state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 socket closed (con state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 socket closed (con state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 socket closed (con state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds3
[Mon Jul 31 09:46:41 2023] ceph: mds3 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds7
[Mon Jul 31 09:46:41 2023] ceph: mds7 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds2
[Mon Jul 31 09:46:41 2023] ceph: mds2 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds4
[Mon Jul 31 09:46:41 2023] ceph: mds4 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds1
[Mon Jul 31 09:46:41 2023] ceph: mds1 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds0
[Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds5
[Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiubli@xxxxxxxxxx>
Sent: Monday, July 31, 2023 12:14 PM
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: MDS stuck in rejoin

On 7/31/23 16:50, Frank Schilder wrote:
Hi Xiubo,

its a kernel client. I actually made a mistake when trying to evict the client and my command didn't do anything. I did another evict and this time the client IP showed up in the blacklist. Furthermore, the warning disappeared. I asked for the dmesg logs from the client node.
Yeah, after the client's sessions are closed the corresponding warning
should be cleared.

Thanks

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiubli@xxxxxxxxxx>
Sent: Monday, July 31, 2023 4:12 AM
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: MDS stuck in rejoin

Hi Frank,

On 7/30/23 16:52, Frank Schilder wrote:
Hi Xiubo,

it happened again. This time, we might be able to pull logs from the client node. Please take a look at my intermediate action below - thanks!

I am in a bit of a calamity, I'm on holidays with terrible network connection and can't do much. My first priority is securing the cluster to avoid damage caused by this issue. I did an MDS evict by client ID on the MDS reporting the warning with the client ID reported in the warning. For some reason the client got blocked on 2 MDSes after this command, one of these is an ordinary stand-by daemon. Not sure if this is expected.

Main question: is this sufficient to prevent any damaging IO on the cluster? I'm thinking here about the MDS eating through all its RAM until it crashes hard in an irrecoverable state (that was described as a consequence in an old post about this warning). If this is a safe state, I can keep it in this state until I return from holidays.
Yeah, I think so.

BTW, are u using the kclients or user space clients ? I checked both
kclient and libcephfs, it seems buggy in libcephfs, which could cause
this issue. But for kclient it's okay till now.

Thanks

- Xiubo

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiubli@xxxxxxxxxx>
Sent: Friday, July 28, 2023 11:37 AM
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: MDS stuck in rejoin

On 7/26/23 22:13, Frank Schilder wrote:
Hi Xiubo.

... I am more interested in the kclient side logs. Just want to
know why that oldest request got stuck so long.
I'm afraid I'm a bad admin in this case. I don't have logs from the host any more, I would have needed the output of dmesg and this is gone. In case it happens again I will try to pull the info out.

The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent than our situation. We had no problems with the MDSes, the cache didn't grow and the relevant one was also not put into read-only mode. It was just this warning showing all the time, health was OK otherwise. I think the warning was there for at least 16h before I failed the MDS.

The MDS log contains nothing, this is the only line mentioning this client:

2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] : client.145678382 does not advance its oldest_client_tid (16121616), 100000 completed requests recorded in session
Okay, if so it's hard to say and dig out what has happened in client why
it didn't advance the tid.

Thanks

- Xiubo

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx