Hi Dominik, I assume that you are talking about a cephfs problem. To identify the root cause you have to debug the log file of the mds servers. Joachim joachim.kraftmayer@xxxxxxxxx www.clyso.com Hohenzollernstr. 27, 80801 Munich Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306 Am Sa., 14. Sept. 2024 um 14:24 Uhr schrieb dominik.baack < dominik.baack@xxxxxxxxxxxxxxxxx>: > Hi, > we started upgrading our Ceph Cluster consisting of 7 Nodes from quincy > to reef two days ago. This included the upgrade of the underlying OS and > several other small changes. > After hitting the osd_remove_queue bug, we could recover mostly, but are > still in a non-healthy state because of changes on our network. (See > attached image) > But overall we can mount the filesystem on all nodes, read and write. > > > The Problem now is that at least one file created by slurmctld exist > which seems to be somehow compromised during this. It cannot be read or > removed on any of the storage or working nodes. All operations (cat, rm, > less, ...) are stuck until they are forcefully terminated. I checked all > nodes manually and could not identify any process with an open handle to > the file. > > If possible we would like to unblock the file, but removing would also > be a possiblity. > > Cheers > Dominik > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx