> Hi, I encountered a problem with blocked MDS operations and a client > becoming unresponsive. I dumped the MDS cache, ops, blocked ops and some > further log information here: > > https://files.dtu.dk/u/peQSOY1kEja35BI5/2010-09-03-mds-blocked-ops?l > > A user of our HPC system was running a job that creates a somewhat > stressful MDS load. This workload tends to lead to MDS warnings like "slow > metadata ops" and "client does not respond to caps release", which usually > disappear without intervantion after a while. We have a HPC cluster with 4K cores with 30+ (large'ish) servers - 128GB => 768GB compute nodes - and have experience similar issues. This bug seem very related: https://tracker.ceph.com/issues/41467 (we havent gotten a version with that patch yet). Upgrading to a 5.2 kernel with this commit: 3e1d0452edceebb903d23db53201013c940bf000 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e1d0452edceebb903d23db53201013c940bf000 Was capable of deadlocking the kernel when memory pressure caused MDS to reclaim capabilities - smells similar. Jesper _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx