Hello, We have an unusual scrub failure on one of our PGs. Ordinarily we can trigger a repair using ceph pg repair, however this mechanism fails to cause a repair operation to be initiated. On looking through the logs, we have discovered the original cause of the scrub error, a single file which has a 'missing attr _, missing attr snapset’ error, however when we run find to locate this file, it does not physically exist on any of the three replicas. The only thing I can think is that some thread hit a suicide timeout whilst carrying out the write to the cluster in between the metadata being written leveldb but before the data could be committed to the FS. When running a rados get against the file, it returns an IO exception (as expected). I have repeatedly sent various commands to attempt to get the OSDs in question to do some maintenance, however they don’t seem to want to do anything. I have tried restarting them, marking the primary as down and out temporarily, all to no avail. I really don’t want to deliberately trigger a large shuffle of data by removing a disk entirely - as it won’t get reinfected into the cluster due to the type of disk it is (smr) - besides I have no guarantee that doing so would change anything. The question is, how can we trigger this to get cleaned up and take the cluster out of HEALTH_ERR? We are running jewel (10.2.6). Regards Stuart Harland _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com