PG badly corrupted after merging PGs on mixed FileStore/BlueStore setup

Paul Emmerich <paul.emmerich@xxxxxxxx> · Wed, 23 Oct 2019 23:03:31 +0200

Hi,

I'm working on a curious case that looks like a bug in PG merging
maybe related to FileStore.

Setup is 14.2.1 that is half BlueStore half FileStore (being
migrated), and the number of PGs on an RGW index pool were reduced,
now one of the PGs (3 FileStore OSDs) seems to be corrupted. There are
some (29) objects that are affected (~20% of the PG), the issue looks
like this for one of the affected objects which I'll call dir.A here

# object seems to exist according to rados
rados -p default.rgw.buckets.index ls | grep .dir.A
.dir.A

# or doesn't it?
rados -p default.rgw.buckets.index get .dir.A -
error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory

Running deep-scrub reports that everything is okay with the affected PG

This is what the OSD logs when trying to access it, nothing really
relevant with debug 20:

10 osd.57 pg_epoch: 1149030 pg[18.2( v 1148996'1422066
(1144429'1418988,1148996'1422066] local-lis/les=1149021/1149022 n=135
ec=49611/596 lis/c 1149021/1149021 les/c/f 1149022/1149022/0
1149015/1149021/1149021) [57,0,31] r=0 lpr=1149021 crt=1148996'1422066
lcod 1148996'1422065 mlcod 0'0 active+clean] get_object_context: no
obc for soid 18:764060e4:::.dir.A:head and !can_create

So going one level deeper with ceph-objectstore-tool:
# --op list
(29 messages like this)
error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory
followed by a complete autoput of the json for the objects including
the broken ones

# .dir.A dump
dump
Error stat on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) No such
file or directory
Error getting snapset on : 18.2_head,#18:73996afb:::.dir.A:head#, (2)
No such file or directory
{
    "id": {
        "oid": ".dir.A",
        "key": "",
        "snapid": -2,
        "hash": 3746994638,
        "max": 0,
        "pool": 18,
        "namespace": "",
        "max": 0
    }
}

# --op export
stops after encountering a bad object with 'export_files error -2'

This is the same for all 3 OSDs in that PG.

Has anyone encountered something similar? I'll probably just nuke the
affected bucket indices tomorrow and re-create them.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx