Hi, thanks for the answer. On Thu, Mar 07, 2019 at 07:48:59PM -0800, David Zafman wrote: > See what results you get from this command. > > # rados list-inconsistent-snapset 2.2bb --format=json-pretty > > You might see this, so nothing interesting. If you don't get json, then > re-run a scrub again. > > { > "epoch": ######, > "inconsistents": [] > } # rados list-inconsistent-snapset 2.2bb --format=json-pretty { "epoch": 485065, "inconsistents": [ { "name": "rbd_data.dfd5e2235befd0.000000000001c299", "nspace": "", "locator": "", "snap": 326022, "errors": [ "headless" ] }, { "name": "rbd_data.dfd5e2235befd0.000000000001c299", "nspace": "", "locator": "", "snap": "head", "snapset": { "snap_context": { "seq": 327360, "snaps": [] }, "head_exists": 1, "clones": [] }, "errors": [ "extra_clones" ], "extra clones": [ 326022 ] } ] } > I don't think you need to do the remove-clone-metadata because you got > "unexpected clone" so I think you'd get "Clone 326022 not present" > > I think you need to remove the clone object from osd.12 and osd.80. For > example: > > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12/ > --journal-path /dev/sdXX --op list rbd_data.dfd5e2235befd0.000000000001c299 > > ["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.000000000001c299","key":"","snapid":-2,"hash":########,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.000000000001c299","key":"","snapid":326022,"hash":#########,"max":0,"pool":2,"namespace":"","max":0}] > > Use the json for snapid 326022 to remove it. > > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12/ > --journal-path /dev/sdXX > '["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.000000000001c299","key":"","snapid":326022,"hash":#########,"max":0,"pool":2,"namespace":"","max":0}]' > remove # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-80/ --journal-path /dev/sda1 --op list rbd_data.dfd5e2235befd0.000000000001c299 --pgid 2.2bb ["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.000000000001c299","key":"","snapid":326022,"hash":3420345019,"max":0,"pool":2,"namespace":"","max":0}] ["2.2bb",{"oid":"rbd_data.dfd5e2235befd I added --pgid 2.2bb because it is taking to long to finish. # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-80/ --journal-path /dev/sda1 '["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.000000000001c299","key":"","snapid":326022,"hash":3420345019,"max":0,"pool":2,"namespace":"","max":0}]' remove remove #2:dd4a7bd3:::rbd_data.dfd5e2235befd0.000000000001c299:4f986# osd.12 was a slight diferent because it is bluestore: # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12/ --op list rbd_data.dfd5e2235befd0.000000000001c299 --pgid 2.2bb ["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.000000000001c299","key":"","snapid":326022,"hash":3420345019,"max":0,"pool":2,"namespace":"","max":0}] ["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.000000000001c299","key":"","snapid":-2,"hash":3420345019,"max":0,"pool":2,"namespace":"","max":0}] # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12/ '["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.000000000001c299","key":"","snapid":326022,"hash":3420345019,"max":0,"pool":2,"namespace":"","max":0}]' remove remove #2:dd4a7bd3:::rbd_data.dfd5e2235befd0.000000000001c299:4f986# But nothing changed, so I tried to repair the pg again and from osd.36 I got now: 2019-03-08 09:09:11.786038 7f920c40d700 -1 log_channel(cluster) log [ERR] : 2.2bb shard 36 soid 2:dd4a7bd3:::rbd_data.dfd5e2235befd0.000000000001c299:4f986 : candidate size 0 info size 4194304 mismatch 2019-03-08 09:09:11.786041 7f920c40d700 -1 log_channel(cluster) log [ERR] : 2.2bb soid 2:dd4a7bd3:::rbd_data.dfd5e2235befd0.000000000001c299:4f986 : failed to pick suitable object info 2019-03-08 09:09:11.786182 7f920c40d700 -1 log_channel(cluster) log [ERR] : repair 2.2bb 2:dd4a7bd3:::rbd_data.dfd5e2235befd0.000000000001c299:4f986 : on disk size (0) does not match object info size (4194304) adjusted for ondisk to (4194304) 2019-03-08 09:09:11.786191 7f920c40d700 -1 log_channel(cluster) log [ERR] : repair 2.2bb 2:dd4a7bd3:::rbd_data.dfd5e2235befd0.000000000001c299:4f986 : is an unexpected clone 2019-03-08 09:09:11.786213 7f920c40d700 -1 osd.36 pg_epoch: 485254 pg[2.2bb( v 485253'15080921 (485236'15079373,485253'15080921] local-lis/les=485251/485252 n=3836 ec=38/38 lis/c 485251/485251 les/c/f 485252/485252/0 485251/485251/484996) [36,12,80] r=0 lpr=485251 crt=485253'15080921 lcod 485252'15080920 mlcod 485252'15080920 active+clean+scrubbing+deep+inconsistent+repair snaptrimq=[5022c~1,50230~1]] _scan_snaps no clone_snaps for 2:dd4a7bd3:::rbd_data.dfd5e2235befd0.000000000001c299:4f986 in 4fec0=[]:{} And: # rados list-inconsistent-snapset 2.2bb --format=json-pretty { "epoch": 485251, "inconsistents": [] } Now I have: HEALTH_ERR 5 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 5 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 2.2bb is active+clean+inconsistent, acting [36,12,80] Jumped from 3 to 5 scrub errors now. Any clues? Thanks again, -- Herbert