Re: Can't enable backfill because of "recover_replicas: object added to missing set for backfill, but is not in recovering, error!"

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 31 Jan 2018 18:20:37 +0000

On Wed, Jan 31, 2018 at 1:40 AM Philip Poten <philip.poten@xxxxxxxxx> wrote:
Hello,

i have this error message:

2018-01-25 00:59:27.357916 7fd646ae1700 -1 osd.3 pg_epoch: 9393 pg[9.139s0( v 8799'82397 (5494'79049,8799'82397] local-lis/les=9392/9393 n=10003 ec=1478/1478 lis/c 9392/6304 les/c/f 9393/6307/807 9391/9392/9392) [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=9392 pi=[6304,9392)/3 bft=9(3),12(2) crt=8799'82397 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] recover_replicas: object added to missing set for backfill, but is not in recovering, error!

in a 3+1 ec pool, and when i enable backfills, the osd starts dying on recovery, which makes the whole cluster flail around. And while the cluster works with this one degraded and remapped pg, not being able to switch on backfills limits my options severely.

Now, I kind of can guess why this is happening (I had to zero out some sectors on a broken harddisk to recover what was left of an already degraded EC pg that I messed up by editing the crush map and removing an OSD in the wrong order - i think), but how can I fix it?

The only other incidence I can find of this is a list post that also did not get any resolution.

Since this a cluster that's used for cephfs, and the files on it are actually recoverable from a different source, if I could find out which files the broken objects belong to, could I just delete/rewrite those files to fix that issue?

Also, how do I find out which objects are the problem? Or can I only deal with this in terms of a whole pg?

The line prior to what you pasted should have the object name in it.

The OSD is complaining because this object is missing on the replica, but it has already backfilled past the point where it should have recovered to the replica, it doesn't have the object in a list of those being recovered (via the log-based recovery, which is a different mechanism), and so it doesn't like the current state. I'm not sure what the best way to resolve it would be, though. :/

Any help to resolve this issue or insight about how to read that logline would be appreciated!

thanks,
Philip
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com