Hi,
we started upgrading our Ceph Cluster consisting of 7 Nodes from quincy
to reef two days ago. This included the upgrade of the underlying OS and
several other small changes.
After hitting the osd_remove_queue bug, we could recover mostly, but are
still in a non-healthy state because of changes on our network. (See
attached image)
But overall we can mount the filesystem on all nodes, read and write.
The Problem now is that at least one file created by slurmctld exist
which seems to be somehow compromised during this. In addition there a
several more files which are currently unidentified but noticable
through stuck container executions.
Each file cannot be read or removed on any of the storage or working
nodes. All operations (cat, rm, less, ...) are stuck until they are
forcefully terminated. I checked all nodes manually and could not
identify any process with an open handle to the file.
A manual deep scrub of all pgs and normal scrub does not show any
further problems
If possible we would like to identify and unblock or remove the
compromised files.
Cheers
Dominik
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx