Blocking/Stuck/Corrupted files

"dominik.baack" <dominik.baack@xxxxxxxxxxxxxxxxx> · Wed, 11 Sep 2024 16:10:42 +0200

Hi,
we started upgrading our Ceph Cluster consisting of 7 Nodes from quincy 
to reef two days ago. This included the upgrade of the underlying OS and 
several other small changes.
After hitting the osd_remove_queue bug, we could recover mostly, but are 
still in a non-healthy state because of changes on our network. (See 
attached image)
But overall we can mount the filesystem on all nodes, read and write.

The Problem now is that at least one file created by slurmctld exist 
which seems to be somehow compromised during this. In addition there a 
several more files which are currently unidentified but noticable 
through stuck container executions.
Each file cannot be read or removed on any of the storage or working 
nodes. All operations (cat, rm, less, ...) are stuck until they are 
forcefully terminated. I checked all nodes manually and could not 
identify any process with an open handle to the file.

A manual deep scrub of all pgs and normal scrub does not show any 
further problems

If possible we would like to identify and unblock or remove the 
compromised files.

Cheers
Dominik
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx