Dear Xiubo, I managed to collect some information. It looks like there is nothing in the dmesg log around the time the client failed to advance its TID. I collected short snippets around the critical time below. I have full logs in case you are interested. Its large files, I will need to do an upload for that. I also have a dump of "mds session ls" output for clients that showed the same issue later. Unfortunately, no consistent log information for a single incident. Here the summary, please let me know if uploading the full package makes sense: - Status: On July 29, 2023 ceph status/df/pool stats/health detail at 01:05:14: cluster: health: HEALTH_WARN 1 pools nearfull ceph status/df/pool stats/health detail at 01:05:28: cluster: health: HEALTH_WARN 1 clients failing to advance oldest client/flush tid 1 pools nearfull [...] On July 31, 2023 ceph status/df/pool stats/health detail at 10:36:16: cluster: health: HEALTH_WARN 1 clients failing to advance oldest client/flush tid 1 pools nearfull cluster: health: HEALTH_WARN 1 pools nearfull - client evict command (date, time, command): 2023-07-31 10:36 ceph tell mds.ceph-11 client evict id=145678457 We have a 1h time difference between the date stamp of the command and the dmesg date stamps. However, there seems to be a weird 10min delay from issuing the evict command until it shows up in dmesg on the client. - dmesg: [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3 [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN) [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN) [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN) [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success [Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22 [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed [Sun Jul 30 16:37:55 2023] slurm.epilog.cl (43668): drop_caches: 3 [Mon Jul 31 01:00:20 2023] slurm.epilog.cl (73347): drop_caches: 3 [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds3 [Mon Jul 31 09:46:41 2023] ceph: mds3 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds7 [Mon Jul 31 09:46:41 2023] ceph: mds7 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds2 [Mon Jul 31 09:46:41 2023] ceph: mds2 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds4 [Mon Jul 31 09:46:41 2023] ceph: mds4 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds1 [Mon Jul 31 09:46:41 2023] ceph: mds1 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds0 [Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds5 [Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Xiubo Li <xiubli@xxxxxxxxxx> Sent: Monday, July 31, 2023 12:14 PM To: Frank Schilder; ceph-users@xxxxxxx Subject: Re: Re: MDS stuck in rejoin On 7/31/23 16:50, Frank Schilder wrote: > Hi Xiubo, > > its a kernel client. I actually made a mistake when trying to evict the client and my command didn't do anything. I did another evict and this time the client IP showed up in the blacklist. Furthermore, the warning disappeared. I asked for the dmesg logs from the client node. Yeah, after the client's sessions are closed the corresponding warning should be cleared. Thanks > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Xiubo Li <xiubli@xxxxxxxxxx> > Sent: Monday, July 31, 2023 4:12 AM > To: Frank Schilder; ceph-users@xxxxxxx > Subject: Re: Re: MDS stuck in rejoin > > Hi Frank, > > On 7/30/23 16:52, Frank Schilder wrote: >> Hi Xiubo, >> >> it happened again. This time, we might be able to pull logs from the client node. Please take a look at my intermediate action below - thanks! >> >> I am in a bit of a calamity, I'm on holidays with terrible network connection and can't do much. My first priority is securing the cluster to avoid damage caused by this issue. I did an MDS evict by client ID on the MDS reporting the warning with the client ID reported in the warning. For some reason the client got blocked on 2 MDSes after this command, one of these is an ordinary stand-by daemon. Not sure if this is expected. >> >> Main question: is this sufficient to prevent any damaging IO on the cluster? I'm thinking here about the MDS eating through all its RAM until it crashes hard in an irrecoverable state (that was described as a consequence in an old post about this warning). If this is a safe state, I can keep it in this state until I return from holidays. > Yeah, I think so. > > BTW, are u using the kclients or user space clients ? I checked both > kclient and libcephfs, it seems buggy in libcephfs, which could cause > this issue. But for kclient it's okay till now. > > Thanks > > - Xiubo > >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Xiubo Li <xiubli@xxxxxxxxxx> >> Sent: Friday, July 28, 2023 11:37 AM >> To: Frank Schilder; ceph-users@xxxxxxx >> Subject: Re: Re: MDS stuck in rejoin >> >> >> On 7/26/23 22:13, Frank Schilder wrote: >>> Hi Xiubo. >>> >>>> ... I am more interested in the kclient side logs. Just want to >>>> know why that oldest request got stuck so long. >>> I'm afraid I'm a bad admin in this case. I don't have logs from the host any more, I would have needed the output of dmesg and this is gone. In case it happens again I will try to pull the info out. >>> >>> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent than our situation. We had no problems with the MDSes, the cache didn't grow and the relevant one was also not put into read-only mode. It was just this warning showing all the time, health was OK otherwise. I think the warning was there for at least 16h before I failed the MDS. >>> >>> The MDS log contains nothing, this is the only line mentioning this client: >>> >>> 2023-07-20T00:22:05.518+0200 7fe13df59700 0 log_channel(cluster) log [WRN] : client.145678382 does not advance its oldest_client_tid (16121616), 100000 completed requests recorded in session >> Okay, if so it's hard to say and dig out what has happened in client why >> it didn't advance the tid. >> >> Thanks >> >> - Xiubo >> >> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx