Re: Can't enable backfill because of "recover_replicas: object added to missing set for backfill, but is not in recovering, error!"

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 02 Feb 2018 17:32:16 +0000

On Wed, Jan 31, 2018 at 9:01 PM Philip Poten <philip.poten@xxxxxxxxx> wrote:
2018-01-31 19:20 GMT+01:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
On Wed, Jan 31, 2018 at 1:40 AM Philip Poten <philip.poten@xxxxxxxxx> wrote:
Hello,

i have this error message:

2018-01-25 00:59:27.357916 7fd646ae1700 -1 osd.3 pg_epoch: 9393 pg[9.139s0( v 8799'82397 (5494'79049,8799'82397] local-lis/les=9392/9393 n=10003 ec=1478/1478 lis/c 9392/6304 les/c/f 9393/6307/807 9391/9392/9392) [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=9392 pi=[6304,9392)/3 bft=9(3),12(2) crt=8799'82397 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] recover_replicas: object added to missing set for backfill, but is not in recovering, error!

The line prior to what you pasted should have the object name in it.

Ok, so since I was unable to find this information with the usual methods, I'll follow up on this:

    -2> 2018-02-01 04:27:27.414658 7f0c7a521700 -1 osd.3 pg_epoch: 10329 pg[9.139s0( v 10329'83100 (5494'79049,10329'83100] local-lis/les=10321/10322 n=9979 ec=1478/1478 lis/c 10321/6304 les/c/f 10322/630
7/807 10318/10321/10318) [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=10321 pi=[6304,10321)/3 bft=9(3),12(2) crt=10329'83099 lcod 10329'83099 mlcod 10329'83099 active+undersized+degraded+remapped+backfilling] re
cover_replicas: object 9:9ccec3b7:::1000021235e.000008dc:head last_backfill 9:9ccec1a8:::100000f6bd9.000001a3:head
    -1> 2018-02-01 04:27:27.414774 7f0c7a521700 -1 osd.3 pg_epoch: 10329 pg[9.139s0( v 10329'83100 (5494'79049,10329'83100] local-lis/les=10321/10322 n=9979 ec=1478/1478 lis/c 10321/6304 les/c/f 10322/6307/807 10318/10321/10318) [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=10321 pi=[6304,10321)/3 bft=9(3),12(2) crt=10329'83099 lcod 10329'83099 mlcod 10329'83099 active+undersized+degraded+remapped+backfilling] recover_replicas: object added to missing set for backfill, but is not in recovering, error!
     0> 2018-02-01 04:27:27.421623 7f0c7a521700 -1 *** Caught signal (Aborted) **
 in thread 7f0c7a521700 thread_name:tp_osd_tp 
 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)

So apparently, "cover_replicas: object 9:9ccec3b7:::1000021235e.000008dc:head last_backfill 9:9ccec1a8:::100000f6bd9.000001a3:head" is the interesting part. But what exactly does it mean?

First, I looked into the contents of the pool and what the keys looked like (rados -p cephfs-data ls). So I figured, the object key isn't the whole thing, but only the something.something part. Then I tried retrieving them manually:

root@lxt-prod-ceph-mon02:~# rados -p cephfs-data get "1000021235e.000008dc" foo
error getting cephfs-data/1000021235e.000008dc: (5) Input/output error
root@lxt-prod-ceph-mon02:~# rados -p cephfs-data get "100000f6bd9.000001a3" foo
root@lxt-prod-ceph-mon02:~# 

Which suggested, that indeed the 08dc key is the culprit, and 01a3 is probably just the last object that was backfilled (?) - it also told me, that the offending object luckily was part of the cephfs-data not the cephfs-metadata pool. phew.

So I tried removing the offending key:

root@lxt-prod-ceph-mon02:~# rados -p cephfs-data rm "1000021235e.000008dc" 
root@lxt-prod-ceph-mon02:~# 

and restarted backfills. And wouldn't you believe it, it restarted backfilling without the OSD crashing!

I haven't found a way to find the cephfs file the object belongs to yet, so if you can guide me with this, please let me know. But I'm sure sooner or later it will make itself known anyway when someone attempts to read it *cough*.

The object is named after the inode number in hex (1000021235e), and then which number object is is in the file (also in hex, starting from 0 — 000008dc).
If you look at the zeroth object in that file, it will have a backtrace xattr which contains an encoded version of the file path. You can use the ceph-dencoder tool to look at the real data if you have to but just dumping it as ascii should get you there.
-Greg

Thanks for your very helpful hint Greg!

Philip

PS: it was one damaged object that prevented me from moving the last degraded pg completely off a broken harddisk and prevented the whole cluster from being maintainable... that really shouldn't happen in my opinion.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com