Re: obj_size_info_mismatch error handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just to follow up for the sake of the mailing list,

I had not had a chance to attempt your steps yet, but things appear to have worked themselves out on their own.

Both scrub errors cleared without intervention, and I'm not sure if it is the results of that object getting touched in CephFS that triggered the update of the size info, or if something else was able to clear it.

Didn't see anything relating to the clearing in mon, mgr, or osd logs.

So, not entirely sure what fixed it, but it is resolved on its own.

Thanks,

Reed

On Apr 30, 2019, at 8:01 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

On Wed, May 1, 2019 at 10:54 AM Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

Which size is correct?

Sorry, accidental discharge =D

If the object info size is *incorrect* try forcing a write to the OI
with something like the following.

1. rados -p [name_of_pool_17] setomapval 10008536718.00000000
temporary-key anything
2. ceph pg deep-scrub 17.2b9
3. Wait for the scrub to finish
4. rados -p [name_of_pool_2] rmomapkey 10008536718.00000000 temporary-key

If the object info size is *correct* you could try just doing a rados
get followed by a rados put of the object to see if the size is
updated correctly.

It's more likely the object info size is wrong IMHO.


On Tue, Apr 30, 2019 at 1:06 AM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:

Hi list,

Woke up this morning to two PG's reporting scrub errors, in a way that I haven't seen before.

$ ceph versions
{
   "mon": {
       "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 3
   },
   "mgr": {
       "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 3
   },
   "osd": {
       "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 156
   },
   "mds": {
       "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 2
   },
   "overall": {
       "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 156,
       "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 8
   }
}


OSD_SCRUB_ERRORS 8 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
   pg 17.72 is active+clean+inconsistent, acting [3,7,153]
   pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]


Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:

{
   "epoch": 134582,
   "inconsistents": [
       {
           "object": {
               "name": "10008536718.00000000",
               "nspace": "",
               "locator": "",
               "snap": "head",
               "version": 0
           },
           "errors": [],
           "union_shard_errors": [
               "obj_size_info_mismatch"
           ],
           "shards": [
               {
                   "osd": 7,
                   "primary": false,
                   "errors": [
                       "obj_size_info_mismatch"
                   ],
                   "size": 5883,
                   "object_info": {
                       "oid": {
                           "oid": "10008536718.00000000",
                           "key": "",
                           "snapid": -2,
                           "hash": 1752643257,
                           "max": 0,
                           "pool": 17,
                           "namespace": ""
                       },
                       "version": "134599'448331",
                       "prior_version": "134599'448330",
                       "last_reqid": "client.1580931080.0:671854",
                       "user_version": 448331,
                       "size": 3505,
                       "mtime": "2019-04-28 15:32:20.003519",
                       "local_mtime": "2019-04-28 15:32:25.991015",
                       "lost": 0,
                       "flags": [
                           "dirty",
                           "data_digest",
                           "omap_digest"
                       ],
                       "truncate_seq": 899,
                       "truncate_size": 0,
                       "data_digest": "0xf99a3bd3",
                       "omap_digest": "0xffffffff",
                       "expected_object_size": 0,
                       "expected_write_size": 0,
                       "alloc_hint_flags": 0,
                       "manifest": {
                           "type": 0
                       },
                       "watchers": {}
                   }
               },
               {
                   "osd": 16,
                   "primary": false,
                   "errors": [
                       "obj_size_info_mismatch"
                   ],
                   "size": 5883,
                   "object_info": {
                       "oid": {
                           "oid": "10008536718.00000000",
                           "key": "",
                           "snapid": -2,
                           "hash": 1752643257,
                           "max": 0,
                           "pool": 17,
                           "namespace": ""
                       },
                       "version": "134599'448331",
                       "prior_version": "134599'448330",
                       "last_reqid": "client.1580931080.0:671854",
                       "user_version": 448331,
                       "size": 3505,
                       "mtime": "2019-04-28 15:32:20.003519",
                       "local_mtime": "2019-04-28 15:32:25.991015",
                       "lost": 0,
                       "flags": [
                           "dirty",
                           "data_digest",
                           "omap_digest"
                       ],
                       "truncate_seq": 899,
                       "truncate_size": 0,
                       "data_digest": "0xf99a3bd3",
                       "omap_digest": "0xffffffff",
                       "expected_object_size": 0,
                       "expected_write_size": 0,
                       "alloc_hint_flags": 0,
                       "manifest": {
                           "type": 0
                       },
                       "watchers": {}
                   }
               },
               {
                   "osd": 19,
                   "primary": true,
                   "errors": [
                       "obj_size_info_mismatch"
                   ],
                   "size": 5883,
                   "object_info": {
                       "oid": {
                           "oid": "10008536718.00000000",
                           "key": "",
                           "snapid": -2,
                           "hash": 1752643257,
                           "max": 0,
                           "pool": 17,
                           "namespace": ""
                       },
                       "version": "134599'448331",
                       "prior_version": "134599'448330",
                       "last_reqid": "client.1580931080.0:671854",
                       "user_version": 448331,
                       "size": 3505,
                       "mtime": "2019-04-28 15:32:20.003519",
                       "local_mtime": "2019-04-28 15:32:25.991015",
                       "lost": 0,
                       "flags": [
                           "dirty",
                           "data_digest",
                           "omap_digest"
                       ],
                       "truncate_seq": 899,
                       "truncate_size": 0,
                       "data_digest": "0xf99a3bd3",
                       "omap_digest": "0xffffffff",
                       "expected_object_size": 0,
                       "expected_write_size": 0,
                       "alloc_hint_flags": 0,
                       "manifest": {
                           "type": 0
                       },
                       "watchers": {}
                   }
               }
           ]
       }
   ]
}


To snip that down to the parts that appear to matter:

"errors": [],
       "union_shard_errors": [
           "obj_size_info_mismatch"
           ],
           "shards": [
               {
                   "errors": [
                       "obj_size_info_mismatch"
                   ],
                   "size": 5883,
                   "object_info": {
                      "size": 3505, }


It looks like the size info, does in fact mismatch (5883 != 3505).

So I attempted a deep-scrub again, and the issue persists across both PG's.

2019-04-29 09:08:27.729 7fe4f5bee700  0 log_channel(cluster) log [DBG] : 17.2b9 deep-scrub starts
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 19 soid 17:9d6cee
16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 7 soid 17:9d6cee1
6:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 shard 16 soid 17:9d6cee
16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 soid 17:9d6cee16:::1000
8536718.00000000:head : failed to pick suitable object info
2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : deep-scrub 17.2b9 17:9d6cee16:
::10008536718.00000000:head : on disk size (5883) does not match object info size (3505) adjusted for o
ndisk to (3505)
2019-04-29 09:27:46.840 7fe4f5bee700 -1 log_channel(cluster) log [ERR] : 17.2b9 deep-scrub 4 errors


Pool 17 is a cephfs data pool, if that makes any difference.
And the two MDS's listed in versions are active:standby, not active:active.

My question is whether I should attempt a `ceph pg repair <pgid>` to attempt a fix of these objects, or take another approach, as the object size mismatch appears to persist across all 3 copies of the PG(s).
I know that ceph pg repair can be dangerous in certain circumstances, so I want to feel confident in the operation before undertaking the repair.

I did look at all underlying disks for these PG's for issues or errors, and none bubbled to the top, so I don't believe it to be a hardware issue in this case.

Appreciate any help.

Thanks,

Reed
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Cheers,
Brad



-- 
Cheers,
Brad

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux