Can't enable backfill because of "recover_replicas: object added to missing set for backfill, but is not in recovering, error!"

Philip Poten <philip.poten@xxxxxxxxx> · Wed, 31 Jan 2018 10:40:15 +0100

Hello,

i have this error message:

2018-01-25 00:59:27.357916 7fd646ae1700 -1 osd.3 pg_epoch: 9393 pg[9.139s0( v 8799'82397 (5494'79049,8799'82397] local-lis/les=9392/9393 n=10003 ec=1478/1478 lis/c 9392/6304 les/c/f 9393/6307/807 9391/9392/9392) [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=9392 pi=[6304,9392)/3 bft=9(3),12(2) crt=8799'82397 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] recover_replicas: object added to missing set for backfill, but is not in recovering, error!

in a 3+1 ec pool, and when i enable backfills, the osd starts dying on recovery, which makes the whole cluster flail around. And while the cluster works with this one degraded and remapped pg, not being able to switch on backfills limits my options severely.

Now, I kind of can guess why this is happening (I had to zero out some sectors on a broken harddisk to recover what was left of an already degraded EC pg that I messed up by editing the crush map and removing an OSD in the wrong order - i think), but how can I fix it?

The only other incidence I can find of this is a list post that also did not get any resolution.

Since this a cluster that's used for cephfs, and the files on it are actually recoverable from a different source, if I could find out which files the broken objects belong to, could I just delete/rewrite those files to fix that issue?

Also, how do I find out which objects are the problem? Or can I only deal with this in terms of a whole pg?

Any help to resolve this issue or insight about how to read that logline would be appreciated!

thanks,
Philip
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com