Unusual inconsistent PG

Stuart Harland <s.harland@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 6 Apr 2017 18:09:23 +0100

Hello,

We have an unusual scrub failure on one of our PGs. Ordinarily we can trigger a repair using ceph pg repair, however this mechanism fails to cause a repair operation to be initiated.

On looking through the logs, we have discovered the original cause of the scrub error, a single file which has a 'missing attr _, missing attr snapset’ error, however when we run find to locate this file, it does not physically exist on any of the three replicas.

The only thing I can think is that some thread hit a suicide timeout whilst carrying out the write to the cluster in between the metadata being written leveldb but before the data could be committed to the FS. When running a rados get against the file, it returns an IO exception (as expected).

I have repeatedly sent various commands to attempt to get the OSDs in question to do some maintenance, however they don’t seem to want to do anything. I have tried restarting them, marking the primary as down and out temporarily, all to no avail. I really don’t want to deliberately trigger a large shuffle of data by removing a disk entirely - as it won’t get reinfected into the cluster due to the type of disk it is (smr) - besides I have no guarantee that doing so would change anything.

The question is, how can we trigger this to get cleaned up and take the cluster out of HEALTH_ERR? We are running jewel (10.2.6).

Regards

Stuart Harland
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com