Hi All, Putting out a call for help to see if anyone can shed some light on this. Configuration: Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a pacemaker cluster Both OSD's and clients are go into a pair of switches, single L2 domain (no sign from pacemaker that there is network connectivity issues) Symptoms: - All RBD's on a single client randomly hang for 30s to several minutes, confirmed by pacemaker and ESXi hosts complaining - Cluster load is minimal when this happens most times - All other clients with RBD's are not affected (Same RADOS pool), so its seems more of a client issue than cluster issue - It looks like pacemaker tries to also stop RBD+FS resource, but this also hangs - Eventually pacemaker succeeds in stopping resources and immediately restarts them, IO returns to normal - No errors, slow requests, or any other non normal Ceph status is reported on the cluster or ceph.log - Client logs show nothing apart from pacemaker Things I've tried: - Different kernels (potentially happened less with older kernels, but can't be 100% sure) - Disabling scrubbing and anything else that could be causing high load - Enabling Kernel RBD debugging (Problem maybe happens a couple of times a day, debug logging was not practical as I can't reproduce on demand) Anyone have any ideas? Thanks, Nick _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com