Re: obj_size_info_mismatch error handling

ceph@xxxxxxxxxx · Wed, 01 May 2019 00:50:40 +0200

Hello Reed,

I would give PG repair a try.
IIRC there should be issue when you have Size 3... it would be difficult when you have Size 2 I guess...

Hth
Mehmet

Am 29. April 2019 17:05:48 MESZ schrieb Reed Dier <reed.dier@xxxxxxxxxxx>:
Hi list,
Woke up this morning to two PG's reporting scrub errors, in a way that I haven't seen before.
$ ceph versions
{
    "mon": {
        "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 3
    },
    "mgr": {
        "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 3
    },
    "osd": {
        "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 156
    },
    "mds": {
        "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 2
    },
    "overall": {
        "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 156,
        "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 8
    }
}

OSD_SCRUB_ERRORS 8 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
    pg 17.72 is active+clean+inconsistent, acting [3,7,153]
    pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]

Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:
{
    "epoch": 134582,
    "inconsistents": [
        {
            "object": {
                "name": "10008536718.00000000",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "version": 0
            },
            "errors": [],
            "union_shard_errors": [
                "obj_size_info_mismatch"
            ],
            "shards": [
                {
                    "osd": 7,
                    "primary": false,
                    "errors": [
                        "obj_size_info_mismatch"
                    ],
                    "size": 5883,
                    "object_info": {
                        "oid": {
                            "oid": "10008536718.00000000",
                            "key": "",
                            "snapid": -2,
                            "hash": 1752643257,
                            "max": 0,
                            "pool": 17,
                            "namespace": ""
                        },
                        "version": "134599'448331",
                        "prior_version": "134599'448330",
                        "last_reqid": "client.1580931080.0:671854",
                        "user_version": 448331,
                        "size": 3505,
                        "mtime": "2019-04-28 15:32:20.003519",
                        "local_mtime": "2019-04-28 15:32:25.991015",
                        "lost": 0,
                        "flags": [
                            "dirty",
                            "data_digest",
                            "omap_digest"
                        ],
                        "truncate_seq": 899,
                        "truncate_size": 0,
                        "data_digest": "0xf99a3bd3",
                        "omap_digest": "0xffffffff",
                        "expected_object_size": 0,
                        "expected_write_size": 0,
                        "alloc_hint_flags": 0,
                        "manifest": {
                            "type": 0
                        },
                        "watchers": {}
                    }
                },
                {
                    "osd": 16,
                    "primary": false,
                    "errors": [
                        "obj_size_info_mismatch"
                    ],
                    "size": 5883,
                    "object_info": {
                        "oid": {
                            "oid": "10008536718.00000000",
                            "key": "",
                            "snapid": -2,
                            "hash": 1752643257,
                            "max": 0,
                            "pool": 17,
                            "namespace": ""
                        },
                        "version": "134599'448331",
                        "prior_version": "134599'448330",
                        "last_reqid": "client.1580931080.0:671854",
                        "user_version": 448331,
                        "size": 3505,
                        "mtime": "2019-04-28 15:32:20.003519",
                        "local_mtime": "2019-04-28 15:32:25.991015",
                        "lost": 0,
                        "flags": [
                            "dirty",
                            "data_digest",
                            "omap_digest"
                        ],
                        "truncate_seq": 899,
                        "truncate_size": 0,
                        "data_digest": "0xf99a3bd3",
                        "omap_digest": "0xffffffff",
                        "expected_object_size": 0,
                        "expected_write_size": 0,
                        "alloc_hint_flags": 0,
                        "manifest": {
                            "type": 0
                        },
                        "watchers": {}
                    }
                },
                {
                    "osd": 19,
                    "primary": true,
                    "errors": [
                        "obj_size_info_mismatch"
                    ],
                    "size": 5883,
                    "object_info": {
                        "oid": {
                            "oid": "10008536718.00000000",
                            "key": "",
                            "snapid": -2,
                            "hash": 1752643257,
                            "max": 0,
                            "pool": 17,
                            "namespace": ""
                        },
                        "version": "134599'448331",
                        "prior_version": "134599'448330",
                        "last_reqid": "client.1580931080.0:671854",
                        "user_version": 448331,
                        "size": 3505,
                        "mtime": "2019-04-28 15:32:20.003519",
                        "local_mtime": "2019-04-28 15:32:25.991015",
                        "lost": 0,
                        "flags": [
                            "dirty",
                            "data_digest",
                            "omap_digest"
                        ],
                        "truncate_seq": 899,
                        "truncate_size": 0,
                        "data_digest": "0xf99a3bd3",
                        "omap_digest": "0xffffffff",
                        "expected_object_size": 0,
                        "expected_write_size": 0,
                        "alloc_hint_flags": 0,
                        "manifest": {
                            "type": 0
                        },
                        "watchers": {}
                    }
                }
            ]
        }
    ]
}

To snip that down to the parts that appear to matter:
	"errors": [],
        "union_shard_errors": [
            "obj_size_info_mismatch"
            ],
            "shards": [
                {
                    "errors": [
                        "obj_size_info_mismatch"
                    ],
                    "size": 5883,
                    "object_info": {
                       "size": 3505, }

It looks like the size info, does in fact mismatch (5883 != 3505).

So I attempted a deep-scrub again, and the issue persists across both PG's.
2019-04-29 09:08:27.729 7fe4f5bee700  0 log_channel(cluster) log [DBG] : 17.2b9 deep-scrub starts
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 19 soid 17:9d6cee
16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 7 soid 17:9d6cee1
6:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 16 soid 17:9d6cee
16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 soid 17:9d6cee16:::1000
8536718.00000000:head : failed to pick suitable object info
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : deep-scrub 17.2b9 17:9d6cee16:
::10008536718.00000000:head : on disk size (5883) does not match object info size (3505) adjusted for o
ndisk to (3505)
2019-04-29 09:27:46.840 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 deep-scrub 4 errors

Pool 17 is a cephfs data pool, if that makes any difference.
And the two MDS's listed in versions are active:standby, not active:active.

My question is whether I should attempt a `ceph pg repair <pgid>` to attempt a fix of these objects, or take another approach, as the object size mismatch appears to persist across all 3 copies of the PG(s).
I know that ceph pg repair can be dangerous in certain circumstances, so I want to feel confident in the operation before undertaking the repair.

I did look at all underlying disks for these PG's for issues or errors, and none bubbled to the top, so I don't believe it to be a hardware issue in this case.

Appreciate any help.

Thanks,

Reed
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com