I just tried what sending SIGSTOP and SIGCONT do. After stopping the process 3 caps were returned. After resuming the process these 3 caps were allocated again. There seems to be a large number of stale caps that are not released. While the process was stopped the kworker thread continued to show 2% CPU usage even though there was no file IO going on. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Thursday, October 19, 2023 10:02 AM To: Stefan Kooman; ceph-users@xxxxxxx Subject: Re: stuck MDS warning: Client HOST failing to respond to cache pressure Hi Stefan, the jobs ended and the warning disappeared as expected. However, a new job started and the warning showed up again. There is something very strange going on and, maybe, you can help out here: We have a low client CAPS limit configured for performance reasons: # ceph config dump | grep client [...] mds advanced mds_max_caps_per_client 65536 The job in question holds more than that: # ceph tell mds.0 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | tail [...] {"id":172249397,"h":"sn272...","addr":"client.172249397 v1:192.168.57.143:0/195146548","fs":"/hpc/home","caps":105417,"req":1442} This CAPS allocation is stable over time, the number doesn't change (I queried multiple times with several minutes interval). My guess is that the MDS message is not about cache pressure but rather about caps trimming. We do have clients that regularly exceed the limit though without MDS warnings. My guess is that these return at least some CAPS on request and are, therefore, not flagged. The client above seems to sit on a fixed set of CAPS that doesn't change and this causes the warning to show up. The strange thing now is that very few files (on ceph fs) are actually open on the client: [USER@sn272 ~]$ lsof -u USER | grep -e /home -e /groups -e /apps | wc -l 170 The kworker thread is at about 3% CPU and should be able to release CAPS. I'm wondering why it doesn't happen though. I also don't believe that 170 open files can allocate 105417 client caps. Questions: - Why does the client have so many caps allocated? Is there another way than open files that requires allocations? - Is there a way to find out what these caps are for? - We will look at the code (its python+miniconda), any pointers what to look for? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Tuesday, October 17, 2023 11:27 AM To: Stefan Kooman; ceph-users@xxxxxxx Subject: Re: stuck MDS warning: Client HOST failing to respond to cache pressure Hi Stefan, probably. Its 2 compute nodes and there are jobs running. Our epilogue script will drop the caches, at which point I indeed expect the warning to disappear. We have no time limit on these nodes though, so this can be a while. I was hoping there was an alternative to that, say, a user-level command that I could execute on the client without possibly affecting other users jobs. Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Stefan Kooman <stefan@xxxxxx> Sent: Tuesday, October 17, 2023 11:13 AM To: Frank Schilder; ceph-users@xxxxxxx Subject: Re: stuck MDS warning: Client HOST failing to respond to cache pressure On 17-10-2023 09:22, Frank Schilder wrote: > Hi all, > > I'm affected by a stuck MDS warning for 2 clients: "failing to respond to cache pressure". This is a false alarm as no MDS is under any cache pressure. The warning is stuck already for a couple of days. I found some old threads about cases where the MDS does not update flags/triggers for this warning in certain situations. Dating back to luminous and I'm probably hitting one of these. > > In these threads I could find a lot except for instructions for how to clear this out in a nice way. Is there something I can do on the clients to clear this warning? I don't want to evict/reboot just because of that. echo 2 > /proc/sys/vm/drop_caches on the clients .... does that help? Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx