Issue with ceph sharedfs

oscar.martin@xxxxxxxx · Wed, 12 Jun 2024 05:45:17 -0000

I am deploying Rook 1.10.13 with Ceph 17.2.6 on our Kubernetes clusters. We are using the Ceph Shared Filesystem a lot and, we have never faced an issue. 

Lately, we have deployed it on Oracle Linux 9 VMs (previous/existing deployments use Centos/RHEL 7) and we are facing the next issue: 

We have 30 worker nodes running a StatefulSet with 30 replicas (each one running on a worker node). The pod in that StatefulSet runs a container with a java process that waits until jobs are submitted. When a job arrives, it processes the request and writes the data into a ceph sharedfs. That ceph sharedfs is a single PVC that is used by all the pods in the StatefulSet.

The problem is that from time to time, some java processes get stuck forever when accessing the fs… e.g. (more than 6 hours in the snipped text below):
```
"th-0-data-writer-site" #503 [505] prio=5 os_prio=0 cpu=451.11ms elapsed=22084.19s tid=0x00007f8c3c04db10 nid=505 runnable  [0x00007f8d8fdfc000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.fs.UnixNativeDispatcher.lstat0(java.base@22-ea/Native Method)
	at sun.nio.fs.UnixNativeDispatcher.lstat(java.base@22-ea/UnixNativeDispatcher.java:351)
	at sun.nio.fs.UnixFileAttributes.get(java.base@22-ea/UnixFileAttributes.java:72)
	at sun.nio.fs.UnixFileSystemProvider.implDelete(java.base@22-ea/UnixFileSystemProvider.java:274)
	at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(java.base@22-ea/AbstractFileSystemProvider.java:109)
	at java.nio.file.Files.deleteIfExists(java.base@22-ea/Files.java:1191)
	at com.x.streams.dataprovider.FileSystemDataProvider.close(FileSystemDataProvider.java:109)
	at com.x.streams.components.XDataWriter.closeWriters(XDataWriter.java:241)
	at com.x.streams.components.XDataWriter.onTerminate(XDataWriter.java:255)
	at com.x.streams.core.StreamReader.doOnTerminate(StreamReader.java:136)
	at com.x.streams.core.StreamReader.processData(StreamReader.java:112)
	at com.x.streams.core.ExecutionEngine$ProcessingThreadTask.run(ExecutionEngine.java:604)
	at java.lang.Thread.runWith(java.base@22-ea/Thread.java:1583)
	at java.lang.Thread.run(java.base@22-ea/Thread.java:1570)
```

Once the system reaches that point, it cannot be recovered until we kill (the pod of) the active mds replica.
If we look at `ceph health detail`, we see this:

```
[root@rook-ceph-tools-75c947bc9d-ggb7m /]# ceph health detail
HEALTH_WARN 3 clients failing to respond to capability release; 1 MDSs report slow requests
[WRN] MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond to capability release
    mds.ceph-filesystem-a(mds.0): Client worker45:csi-cephfs-node failing to respond to capability release client_id: 5927564
    mds.ceph-filesystem-a(mds.0): Client worker1:csi-cephfs-node failing to respond to capability release client_id: 7804133
    mds.ceph-filesystem-a(mds.0): Client worker39:csi-cephfs-node failing to respond to capability release client_id: 8391464
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.ceph-filesystem-a(mds.0): 31 slow requests are blocked > 30 secs
```

Any hint about how to troubleshoot it? My intuition says that it could happen that certain sharedfs released caps cannot reach the MDS and then that portion of the sharedfs is locked for good. But I am completely making this up. I’d appreciate it if anyone could provide some indications about how to troubleshoot it or give me some hints.
We have some clusters running in production with almost the same configuration (but the OS) and everything runs ok there. But we cannot find the reason why we are getting this behavior here.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx