Re: obj_size_info_mismatch error handling

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 1 May 2019 11:01:16 +1000

On Wed, May 1, 2019 at 10:54 AM Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>
> Which size is correct?

Sorry, accidental discharge =D

If the object info size is *incorrect* try forcing a write to the OI
with something like the following.

1. rados -p [name_of_pool_17] setomapval 10008536718.00000000
temporary-key anything
2. ceph pg deep-scrub 17.2b9
3. Wait for the scrub to finish
4. rados -p [name_of_pool_2] rmomapkey 10008536718.00000000 temporary-key

If the object info size is *correct* you could try just doing a rados
get followed by a rados put of the object to see if the size is
updated correctly.

It's more likely the object info size is wrong IMHO.

>
> On Tue, Apr 30, 2019 at 1:06 AM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:
> >
> > Hi list,
> >
> > Woke up this morning to two PG's reporting scrub errors, in a way that I haven't seen before.
> >
> > $ ceph versions
> > {
> >     "mon": {
> >         "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 3
> >     },
> >     "mgr": {
> >         "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 3
> >     },
> >     "osd": {
> >         "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 156
> >     },
> >     "mds": {
> >         "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 2
> >     },
> >     "overall": {
> >         "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 156,
> >         "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 8
> >     }
> > }
> >
> >
> > OSD_SCRUB_ERRORS 8 scrub errors
> > PG_DAMAGED Possible data damage: 2 pgs inconsistent
> >     pg 17.72 is active+clean+inconsistent, acting [3,7,153]
> >     pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
> >
> >
> > Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:
> >
> > {
> >     "epoch": 134582,
> >     "inconsistents": [
> >         {
> >             "object": {
> >                 "name": "10008536718.00000000",
> >                 "nspace": "",
> >                 "locator": "",
> >                 "snap": "head",
> >                 "version": 0
> >             },
> >             "errors": [],
> >             "union_shard_errors": [
> >                 "obj_size_info_mismatch"
> >             ],
> >             "shards": [
> >                 {
> >                     "osd": 7,
> >                     "primary": false,
> >                     "errors": [
> >                         "obj_size_info_mismatch"
> >                     ],
> >                     "size": 5883,
> >                     "object_info": {
> >                         "oid": {
> >                             "oid": "10008536718.00000000",
> >                             "key": "",
> >                             "snapid": -2,
> >                             "hash": 1752643257,
> >                             "max": 0,
> >                             "pool": 17,
> >                             "namespace": ""
> >                         },
> >                         "version": "134599'448331",
> >                         "prior_version": "134599'448330",
> >                         "last_reqid": "client.1580931080.0:671854",
> >                         "user_version": 448331,
> >                         "size": 3505,
> >                         "mtime": "2019-04-28 15:32:20.003519",
> >                         "local_mtime": "2019-04-28 15:32:25.991015",
> >                         "lost": 0,
> >                         "flags": [
> >                             "dirty",
> >                             "data_digest",
> >                             "omap_digest"
> >                         ],
> >                         "truncate_seq": 899,
> >                         "truncate_size": 0,
> >                         "data_digest": "0xf99a3bd3",
> >                         "omap_digest": "0xffffffff",
> >                         "expected_object_size": 0,
> >                         "expected_write_size": 0,
> >                         "alloc_hint_flags": 0,
> >                         "manifest": {
> >                             "type": 0
> >                         },
> >                         "watchers": {}
> >                     }
> >                 },
> >                 {
> >                     "osd": 16,
> >                     "primary": false,
> >                     "errors": [
> >                         "obj_size_info_mismatch"
> >                     ],
> >                     "size": 5883,
> >                     "object_info": {
> >                         "oid": {
> >                             "oid": "10008536718.00000000",
> >                             "key": "",
> >                             "snapid": -2,
> >                             "hash": 1752643257,
> >                             "max": 0,
> >                             "pool": 17,
> >                             "namespace": ""
> >                         },
> >                         "version": "134599'448331",
> >                         "prior_version": "134599'448330",
> >                         "last_reqid": "client.1580931080.0:671854",
> >                         "user_version": 448331,
> >                         "size": 3505,
> >                         "mtime": "2019-04-28 15:32:20.003519",
> >                         "local_mtime": "2019-04-28 15:32:25.991015",
> >                         "lost": 0,
> >                         "flags": [
> >                             "dirty",
> >                             "data_digest",
> >                             "omap_digest"
> >                         ],
> >                         "truncate_seq": 899,
> >                         "truncate_size": 0,
> >                         "data_digest": "0xf99a3bd3",
> >                         "omap_digest": "0xffffffff",
> >                         "expected_object_size": 0,
> >                         "expected_write_size": 0,
> >                         "alloc_hint_flags": 0,
> >                         "manifest": {
> >                             "type": 0
> >                         },
> >                         "watchers": {}
> >                     }
> >                 },
> >                 {
> >                     "osd": 19,
> >                     "primary": true,
> >                     "errors": [
> >                         "obj_size_info_mismatch"
> >                     ],
> >                     "size": 5883,
> >                     "object_info": {
> >                         "oid": {
> >                             "oid": "10008536718.00000000",
> >                             "key": "",
> >                             "snapid": -2,
> >                             "hash": 1752643257,
> >                             "max": 0,
> >                             "pool": 17,
> >                             "namespace": ""
> >                         },
> >                         "version": "134599'448331",
> >                         "prior_version": "134599'448330",
> >                         "last_reqid": "client.1580931080.0:671854",
> >                         "user_version": 448331,
> >                         "size": 3505,
> >                         "mtime": "2019-04-28 15:32:20.003519",
> >                         "local_mtime": "2019-04-28 15:32:25.991015",
> >                         "lost": 0,
> >                         "flags": [
> >                             "dirty",
> >                             "data_digest",
> >                             "omap_digest"
> >                         ],
> >                         "truncate_seq": 899,
> >                         "truncate_size": 0,
> >                         "data_digest": "0xf99a3bd3",
> >                         "omap_digest": "0xffffffff",
> >                         "expected_object_size": 0,
> >                         "expected_write_size": 0,
> >                         "alloc_hint_flags": 0,
> >                         "manifest": {
> >                             "type": 0
> >                         },
> >                         "watchers": {}
> >                     }
> >                 }
> >             ]
> >         }
> >     ]
> > }
> >
> >
> > To snip that down to the parts that appear to matter:
> >
> > "errors": [],
> >         "union_shard_errors": [
> >             "obj_size_info_mismatch"
> >             ],
> >             "shards": [
> >                 {
> >                     "errors": [
> >                         "obj_size_info_mismatch"
> >                     ],
> >                     "size": 5883,
> >                     "object_info": {
> >                        "size": 3505, }
> >
> >
> > It looks like the size info, does in fact mismatch (5883 != 3505).
> >
> > So I attempted a deep-scrub again, and the issue persists across both PG's.
> >
> > 2019-04-29 09:08:27.729 7fe4f5bee700  0 log_channel(cluster) log [DBG] : 17.2b9 deep-scrub starts
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 19 soid 17:9d6cee
> > 16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 7 soid 17:9d6cee1
> > 6:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 16 soid 17:9d6cee
> > 16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 soid 17:9d6cee16:::1000
> > 8536718.00000000:head : failed to pick suitable object info
> > 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : deep-scrub 17.2b9 17:9d6cee16:
> > ::10008536718.00000000:head : on disk size (5883) does not match object info size (3505) adjusted for o
> > ndisk to (3505)
> > 2019-04-29 09:27:46.840 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 deep-scrub 4 errors
> >
> >
> > Pool 17 is a cephfs data pool, if that makes any difference.
> > And the two MDS's listed in versions are active:standby, not active:active.
> >
> > My question is whether I should attempt a `ceph pg repair <pgid>` to attempt a fix of these objects, or take another approach, as the object size mismatch appears to persist across all 3 copies of the PG(s).
> > I know that ceph pg repair can be dangerous in certain circumstances, so I want to feel confident in the operation before undertaking the repair.
> >
> > I did look at all underlying disks for these PG's for issues or errors, and none bubbled to the top, so I don't believe it to be a hardware issue in this case.
> >
> > Appreciate any help.
> >
> > Thanks,
> >
> > Reed
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com