On Wed, Oct 23, 2019 at 11:27 PM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Wed, 23 Oct 2019, Paul Emmerich wrote: > > Hi, > > > > I'm working on a curious case that looks like a bug in PG merging > > maybe related to FileStore. > > > > Setup is 14.2.1 that is half BlueStore half FileStore (being > > migrated), and the number of PGs on an RGW index pool were reduced, > > now one of the PGs (3 FileStore OSDs) seems to be corrupted. There are > > some (29) objects that are affected (~20% of the PG), the issue looks > > like this for one of the affected objects which I'll call dir.A here > > > > # object seems to exist according to rados > > rados -p default.rgw.buckets.index ls | grep .dir.A > > .dir.A > > > > # or doesn't it? > > rados -p default.rgw.buckets.index get .dir.A - > > error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory > > > > Running deep-scrub reports that everything is okay with the affected PG > > My guess is that the actual file is not in the right directory hash level. > Did you look at the underlying file system to see if it is clearly out of > place with the other objects? PG is tiny with only ~150 files, so they aren't split into dirs, it's right there next to all the working objects > Also, I'm curious if all of the replicas are similarly affected? What > happens if you move the primary to one of the other replicas (e.g., via > ceph osd primary-affinity) and try reading it then? yes, I've tried all 3 replicas, same problem :( Paul > > s > > > > > This is what the OSD logs when trying to access it, nothing really > > relevant with debug 20: > > > > 10 osd.57 pg_epoch: 1149030 pg[18.2( v 1148996'1422066 > > (1144429'1418988,1148996'1422066] local-lis/les=1149021/1149022 n=135 > > ec=49611/596 lis/c 1149021/1149021 les/c/f 1149022/1149022/0 > > 1149015/1149021/1149021) [57,0,31] r=0 lpr=1149021 crt=1148996'1422066 > > lcod 1148996'1422065 mlcod 0'0 active+clean] get_object_context: no > > obc for soid 18:764060e4:::.dir.A:head and !can_create > > > > So going one level deeper with ceph-objectstore-tool: > > # --op list > > (29 messages like this) > > error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory > > followed by a complete autoput of the json for the objects including > > the broken ones > > > > # .dir.A dump > > dump > > Error stat on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) No such > > file or directory > > Error getting snapset on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) > > No such file or directory > > { > > "id": { > > "oid": ".dir.A", > > "key": "", > > "snapid": -2, > > "hash": 3746994638, > > "max": 0, > > "pool": 18, > > "namespace": "", > > "max": 0 > > } > > } > > > > # --op export > > stops after encountering a bad object with 'export_files error -2' > > > > This is the same for all 3 OSDs in that PG. > > > > Has anyone encountered something similar? I'll probably just nuke the > > affected bucket indices tomorrow and re-create them. > > > > Paul > > > > -- > > Paul Emmerich > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > croit GmbH > > Freseniusstr. 31h > > 81247 München > > www.croit.io > > Tel: +49 89 1896585 90 > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx