On 24/07/18 04:12, Brad Hubbard wrote: > Is there anything unusual or different about osd 44? Is there anything > different about its history? > Nothing that we are aware of. Debug level of logs is also very low, so not much we can see. > It seems 44 rarely agrees with others. Yes, the same procedure should > fix it with the same caveats. We fixed it again. Thanks a lot. I guess if it happens again we will just put OSD 44 out. > > On Mon, Jul 23, 2018 at 11:48 PM, Ana Aviles <ana@xxxxxxxxxxxx> wrote: >> I replaced the object with rados as suggested, and right after forced a >> deep scrub which got us back to HEALTH_OK >> >> However, now we are on another inconsistent PG status. For the same rbd >> image, but a different object. The object that was also mentioned in the >> previous inconsistent PG. But, now its worse because we have a >> data_digest mismatch. I wondered if this tells anything about the >> previous substitution, or I should just go the same path replacing this >> object with rados. >> >> >> pg 0.186 is active+clean+inconsistent, acting [36,26,44] >> >> rados list-inconsistent-obj 0.186 >> { >> "epoch": 30586, >> "inconsistents": [ >> { >> "object": { >> "name": "rbd_data.15cec2ae8944a.000000000004db0e", >> "nspace": "", >> "locator": "", >> "snap": "head", >> "version": 5493833 >> }, >> "errors": [ >> "object_info_inconsistency", >> "data_digest_mismatch", >> "attr_value_mismatch" >> ], >> "union_shard_errors": [ >> "data_digest_mismatch_oi" >> ], >> "selected_object_info": >> "0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head(30587'5493833 >> client.1246390.0:1 dirty|data_digest|omap_digest s 4194304 uv 5493833 dd >> 264b7d0d od ffffffff alloc_hint [0 0])", >> "shards": [ >> { >> "osd": 26, >> "errors": [ >> "data_digest_mismatch_oi" >> ], >> "size": 4194304, >> "omap_digest": "0xffffffff", >> "data_digest": "0x7dd0d0bd", >> "object_info": >> "0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 >> osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd >> 7dd0d0bd od ffffffff alloc_hint [4194304 4194304])", >> "attrs": [ >> { >> "name": "_", >> "value": >> "EAggAQAABANIAAAAAAAAACcAAAByYmRfZGF0YS4xNWNlYzJhZTg5NDRhLjAwMDAwMDAwMDAwNGRiMGX+\/\/\/\/\/\/\/\/\/4Zx7B4AAAAAAAAAAAAAAAAABgMcAAAAAAAAAAAAAAD\/\/\/\/\/AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/wAAAABREFQAAAAAAEl3AADi5FMAAAAAAEl3AAACAhUAAAAEJAAAAAAAAABtf4IAAAAAAAAAAAAAAEAAAAAAAPpfSVvV\/VMKAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA4uRTAAAAAAAAAAAAAAAAAAA0AAAA+l9JW4x6Rw290NB9\/\/\/\/\/wAAQAAAAAAAAABAAAAAAAAAAAAA", >> "Base64": true >> }, >> { >> "name": "snapset", >> "value": >> "AgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==", >> "Base64": true >> } >> ] >> }, >> { >> "osd": 36, >> "errors": [ >> "data_digest_mismatch_oi" >> ], >> "size": 4194304, >> "omap_digest": "0xffffffff", >> "data_digest": "0x7dd0d0bd", >> "object_info": >> "0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 >> osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd >> 7dd0d0bd od ffffffff alloc_hint [4194304 4194304])", >> "attrs": [ >> { >> "name": "_", >> "value": >> "EAggAQAABANIAAAAAAAAACcAAAByYmRfZGF0YS4xNWNlYzJhZTg5NDRhLjAwMDAwMDAwMDAwNGRiMGX+\/\/\/\/\/\/\/\/\/4Zx7B4AAAAAAAAAAAAAAAAABgMcAAAAAAAAAAAAAAD\/\/\/\/\/AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/wAAAABREFQAAAAAAEl3AADi5FMAAAAAAEl3AAACAhUAAAAEJAAAAAAAAABtf4IAAAAAAAAAAAAAAEAAAAAAAPpfSVvV\/VMKAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA4uRTAAAAAAAAAAAAAAAAAAA0AAAA+l9JW4x6Rw290NB9\/\/\/\/\/wAAQAAAAAAAAABAAAAAAAAAAAAA", >> "Base64": true >> }, >> ] >> }, >> { >> "osd": 44, >> "errors": [], >> "size": 4194304, >> "omap_digest": "0xffffffff", >> "data_digest": "0x264b7d0d", >> "object_info": >> "0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head(30587'5493833 >> client.1246390.0:1 dirty|data_digest|omap_digest s 4194304 uv 5493833 dd >> 264b7d0d od ffffffff alloc_hint [0 0])", >> "attrs": [ >> { >> "name": "_", >> "value": >> "EAggAQAABANIAAAAAAAAACcAAAByYmRfZGF0YS4xNWNlYzJhZTg5NDRhLjAwMDAwMDAwMDAxNWM3ZDb+\/\/\/\/\/\/\/\/\/5BDu3wAAAAAAAAAAAAAAAAABgMcAAAAAAAAAAAAAAD\/\/\/\/\/AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/wAAAABJ1FMAAAAAAHt3AAD0eE4AAAAAALxoAAACAhUAAAAItgQTAAAAAAABAAAAAAAAAAAAAAAAAEAAAAAAAIbaUVtSy\/8jAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASdRTAAAAAAAAAAAAAAAAAAA0AAAAhtpRW\/VN8CQNfUsm\/\/\/\/\/wAAAAAAAAAAAAAAAAAAAAAAAAAA", >> "Base64": true >> }, >> { >> "name": "snapset", >> "value": >> "AgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==", >> "Base64": true >> } >> ] >> } >> ] >> } >> ] >> } >> >> >> On 20/07/18 00:27, Brad Hubbard wrote: >>> On Fri, Jul 20, 2018 at 1:05 AM, Ana Aviles <ana@xxxxxxxxxxxx> wrote: >>>> >>>> >>>> On 19/07/18 03:25, Brad Hubbard wrote: >>>>> On Wed, Jul 18, 2018 at 6:25 PM, Ana Aviles <ana@xxxxxxxxxxxx> wrote: >>>>>> Ah ok. Then I think it confirms what you are saying. Here it is: >>>>>> >>>>>> $ rados list-inconsistent-obj 0.190 >>>>>> {"epoch":30579,"inconsistents":[{"object":{"name":"rbd_data.15cec2ae8944a.000000000015c7d6","nspace":"","locator":"","snap":"head","version":5498082},"errors":["object_info_inconsistency","attr_value_mismatch"],"union_shard_errors":[],"selected_object_info":"0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 >>>>>> osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd >>>>>> 7dd0d0bd od ffffffff alloc_hint [4194304 >>>>>> 4194304])","shards":[{"osd":16,"errors":[],"size":4194304,"object_info":"0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head(26812'5142772 >>>>>> client.1044166.0:393154060 dirty|data_digest|omap_digest s 4194304 uv >>>>>> 5142772 dd 264b7d0d od ffffffff alloc_hint [0 >>>>>> 0])","attrs":[{"name":"_","value":"DwgMAQAABANIAAAAAAAAACcAAAByYmRfZGF0YS4xNWNlYzJhZTg5NDRhLjAwMDAwMDAwMDAxNWM3ZDb+\/\/\/\/\/\/\/\/\/5BDu3wAAAAAAAAAAAAAAAAABgMcAAAAAAAAAAAAAAD\/\/\/\/\/AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/wAAAAD0eE4AAAAAALxoAADzeE4AAAAAALxoAAACAhUAAAAIxu4PAAAAAAAMDm8XAAAAAAAAAAAAAEAAAAAAAOJmPVsEa24SAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9HhOAAAAAAAAAAAAAAAAAAA0AAAA4mY9W1Q\/lBwNfUsm\/\/\/\/\/w==","Base64":true},{"name":"snapset","value":"AgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==","Base64":true}]},{"osd":37,"errors":[],"size":4194304,"object_info":"0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head(26812'5142772 >>>>>> client.1044166.0:393154060 dirty|data_digest|omap_digest s 4194304 uv >>>>>> 5142772 dd 264b7d0d od ffffffff alloc_hint [0 >>>>>> 0])","attrs":[{"name":"_","value":"DwgMAQAABANIAAAAAAAAACcAAAByYmRfZGF0YS4xNWNlYzJhZTg5NDRhLjAwMDAwMDAwMDAxNWM3ZDb+\/\/\/\/\/\/\/\/\/5BDu3wAAAAAAAAAAAAAAAAABgMcAAAAAAAAAAAAAAD\/\/\/\/\/AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/wAAAAD0eE4AAAAAALxoAADzeE4AAAAAALxoAAACAhUAAAAIxu4PAAAAAAAMDm8XAAAAAAAAAAAAAEAAAAAAAOJmPVsEa24SAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9HhOAAAAAAAAAAAAAAAAAAA0AAAA4mY9W1Q\/lBwNfUsm\/\/\/\/\/w==","Base64":true},{"name":"snapset","value":"AgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==","Base64":true}]},{"osd":44,"errors":[],"size":4194304,"object_info":"0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 >>>>>> osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd >>>>>> 7dd0d0bd od ffffffff alloc_hint [4194304 >>>>>> 4194304])","attrs":[{"name":"_","value":"EAggAQAABANIAAAAAAAAACcAAAByYmRfZGF0YS4xNWNlYzJhZTg5NDRhLjAwMDAwMDAwMDAwNGRiMGX+\/\/\/\/\/\/\/\/\/4Zx7B4AAAAAAAAAAAAAAAAABgMcAAAAAAAAAAAAAAD\/\/\/\/\/AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/wAAAABREFQAAAAAAEl3AADi5FMAAAAAAEl3AAACAhUAAAAEJAAAAAAAAABtf4IAAAAAAAAAAAAAAEAAAAAAAPpfSVvV\/VMKAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA4uRTAAAAAAAAAAAAAAAAAAA0AAAA+l9JW4x6Rw290NB9\/\/\/\/\/wAAQAAAAAAAAABAAAAAAAAAAAAA","Base64":true},{"name":"snapset","value":"AgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==","Base64":true}]}]}]} >>>>>> >>>>>> >>>>>> To determine which is the right version of the object, is there no >>>>>> timestamp that can tell us? maybe the object got updated to osd.37 and >>>>>> osd.16 while osd.44 was down, and there comes the missmatch? because >>>>>> otherwise, shouldn't the authoritative osd be leading? >>>>> >>>>> The primary will be serving IO requests so the version on osd 37 is >>>>> what will be read by clients so I guess going with that is reasonable. >>>>> >>>> >>>> OK good. >>>> >>>>> The version on osd 44 was actually modified after the others (epoch >>>>> 30537, as opposed to epoch 26812) but the sizes are all the same so >>>>> the difference may be trivial (metadat only perhaps) and, according to >>>>> the last request id (osd.36.0:8552301) came from another osd (36) >>>>> which is kind of unexpected. Is there, or was there, a cache tier >>>>> involved? >>>> >>>> Ah OK, very interesting! No, no cache tier involved. So at one point >>>> osd.36 was part of the PG set? >>> >>> Maybe, all we know is that the last request came from osd.36 which is >>> unusual because changes in this context generally only come from >>> clients. A cache tier might explain it which is why I mentioned it. >>> >>>> >>>>> >>>>> If you want to go with the version that is currently being used (37 >>>>> and 16) you can just quiesce the rbd image clients and do a rados get, >>>>> then a rados put of the object. I would suggest taking a backup of the >>>>> object from osd 44 using the ceph-objectstore-tool although, as I >>>>> said, that version will not be being used so I doubt you will miss it. >>>>> >>>> >>>> Great, will do that. Thanks a lot for help. >>> >>> yw. >>> >>>> >>>>>> >>>>>> Regards, >>>>>> Ana >>>>>> >>>>>> >>>>>> On 18/07/18 05:24, Brad Hubbard wrote: >>>>>>> OK. What I *meant* to ask for was the output of "rados >>>>>>> list-inconsistent-obj 0.190" (might still be worth posting that but it >>>>>>> should just confirm findings below). >>>>>>> >>>>>>> >>>>>>> The relevant lines from the log are below. >>>>>>> >>>>>>> 2018-07-16 12:24:45.940910 7fb422340700 2 osd.37 pg_epoch: 30554 >>>>>>> pg[0.190( v 30554'5390084 (30537'5387075,30554'5390084] >>>>>>> local-les=30554 n=4123 ec=1 les/c/f 30554/30554/0 30552/30553/30542) >>>>>>> [37,44,16] r=0 lpr=30553 crt=30554'5390079 lcod 30554'5390083 mlcod >>>>>>> 30554'5390083 active+clean+scrubbing+deep+inconsistent+repair] 0.190 >>>>>>> shard 16: soid 0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head >>>>>>> data_digest 0x264b7d0d != data_digest 0x7dd0d0bd from shard 44, >>>>>>> data_digest 0x264b7d0d != data_digest 0x7dd0d0bd from auth oi >>>>>>> 0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 >>>>>>> osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd >>>>>>> 7dd0d0bd od ffffffff alloc_hint [4194304 4194304]), attr value >>>>>>> mismatch '_' 2018-07-16 12:24:45.940941 7fb422340700 -1 >>>>>>> log_channel(cluster) log [ERR] : 0.190 shard 16: soid >>>>>>> 0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head data_digest >>>>>>> 0x264b7d0d != data_digest 0x7dd0d0bd from shard 44, data_digest >>>>>>> 0x264b7d0d != data_digest 0x7dd0d0bd from auth oi >>>>>>> 0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 >>>>>>> osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd >>>>>>> 7dd0d0bd od ffffffff alloc_hint [4194304 4194304]), attr value >>>>>>> mismatch '_' 2018-07-16 12:24:45.940957 7fb422340700 -1 >>>>>>> log_channel(cluster) log [ERR] : 0.190 shard 37: soid >>>>>>> 0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head data_digest >>>>>>> 0x264b7d0d != data_digest 0x7dd0d0bd from shard 44, data_digest >>>>>>> 0x264b7d0d != data_digest 0x7dd0d0bd from auth oi >>>>>>> 0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 >>>>>>> osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd >>>>>>> 7dd0d0bd od ffffffff alloc_hint [4194304 4194304]), attr value >>>>>>> mismatch '_' >>>>>>> >>>>>>> They show that osd 44 has been chosen as the authoritative shard and >>>>>>> and it has a data digest for this object of 0x7dd0d0bd and that the >>>>>>> data digest in the authoritative object info is also 0x7dd0d0bd. >>>>>>> >>>>>>> Shard 16 however, has a data digest of 0x264b7d0d and so does shard 37 >>>>>>> so the data for this object on osds 16 and 37 is different to that on >>>>>>> osd 44. >>>>>>> >>>>>>> Basically, you'll need to pick which is the "right" copy of the object >>>>>>> (I can't tell you) quiesce traffic to/from that object (rbd image) and >>>>>>> get/put that object back into the cluster to fix the mismatch. Since >>>>>>> this appears to be an rbd image this could potentially result in an >>>>>>> image that needs an fsck or equivalent IIUC. >>>>>>> >>>>>>> >>>>>>> On Tue, Jul 17, 2018 at 10:06 PM, Ana Aviles <ana@xxxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> Hi Brad, >>>>>>>> >>>>>>>> Here it is: >>>>>>>> >>>>>>>> { >>>>>>>> "state": "active+clean+inconsistent", >>>>>>>> "snap_trimq": "[]", >>>>>>>> "epoch": 30581, >>>>>>>> "up": [ >>>>>>>> 37, >>>>>>>> 44, >>>>>>>> 16 >>>>>>>> ], >>>>>>>> "acting": [ >>>>>>>> 37, >>>>>>>> 44, >>>>>>>> 16 >>>>>>>> ], >>>>>>>> "actingbackfill": [ >>>>>>>> "16", >>>>>>>> "37", >>>>>>>> "44" >>>>>>>> ], >>>>>>>> "info": { >>>>>>>> "pgid": "0.190", >>>>>>>> "last_update": "30581'5420535", >>>>>>>> "last_complete": "30581'5420535", >>>>>>>> "log_tail": "30581'5417484", >>>>>>>> "last_user_version": 5420535, >>>>>>>> "last_backfill": "MAX", >>>>>>>> "last_backfill_bitwise": 0, >>>>>>>> "purged_snaps": "[]", >>>>>>>> "history": { >>>>>>>> "epoch_created": 1, >>>>>>>> "last_epoch_started": 30580, >>>>>>>> "last_epoch_clean": 30581, >>>>>>>> "last_epoch_split": 0, >>>>>>>> "last_epoch_marked_full": 0, >>>>>>>> "same_up_since": 30578, >>>>>>>> "same_interval_since": 30579, >>>>>>>> "same_primary_since": 30565, >>>>>>>> "last_scrub": "30554'5390240", >>>>>>>> "last_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_deep_scrub": "30554'5390240", >>>>>>>> "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555" >>>>>>>> }, >>>>>>>> "stats": { >>>>>>>> "version": "30581'5420535", >>>>>>>> "reported_seq": "5155553", >>>>>>>> "reported_epoch": "30581", >>>>>>>> "state": "active+clean+inconsistent", >>>>>>>> "last_fresh": "2018-07-17 12:02:13.002428", >>>>>>>> "last_change": "2018-07-16 13:37:24.020403", >>>>>>>> "last_active": "2018-07-17 12:02:13.002428", >>>>>>>> "last_peered": "2018-07-17 12:02:13.002428", >>>>>>>> "last_clean": "2018-07-17 12:02:13.002428", >>>>>>>> "last_became_active": "2018-07-16 13:37:13.173821", >>>>>>>> "last_became_peered": "2018-07-16 13:37:13.173821", >>>>>>>> "last_unstale": "2018-07-17 12:02:13.002428", >>>>>>>> "last_undegraded": "2018-07-17 12:02:13.002428", >>>>>>>> "last_fullsized": "2018-07-17 12:02:13.002428", >>>>>>>> "mapping_epoch": 30578, >>>>>>>> "log_start": "30581'5417484", >>>>>>>> "ondisk_log_start": "30581'5417484", >>>>>>>> "created": 1, >>>>>>>> "last_epoch_clean": 30581, >>>>>>>> "parent": "0.0", >>>>>>>> "parent_split_bits": 0, >>>>>>>> "last_scrub": "30554'5390240", >>>>>>>> "last_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_deep_scrub": "30554'5390240", >>>>>>>> "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555", >>>>>>>> "log_size": 3051, >>>>>>>> "ondisk_log_size": 3051, >>>>>>>> "stats_invalid": false, >>>>>>>> "dirty_stats_invalid": false, >>>>>>>> "omap_stats_invalid": false, >>>>>>>> "hitset_stats_invalid": false, >>>>>>>> "hitset_bytes_stats_invalid": false, >>>>>>>> "pin_stats_invalid": true, >>>>>>>> "stat_sum": { >>>>>>>> "num_bytes": 16946139153, >>>>>>>> "num_objects": 4148, >>>>>>>> "num_object_clones": 0, >>>>>>>> "num_object_copies": 12444, >>>>>>>> "num_objects_missing_on_primary": 0, >>>>>>>> "num_objects_missing": 0, >>>>>>>> "num_objects_degraded": 0, >>>>>>>> "num_objects_misplaced": 0, >>>>>>>> "num_objects_unfound": 0, >>>>>>>> "num_objects_dirty": 4148, >>>>>>>> "num_whiteouts": 0, >>>>>>>> "num_read": 6895104, >>>>>>>> "num_read_kb": 292185552, >>>>>>>> "num_write": 10032749, >>>>>>>> "num_write_kb": 185167701, >>>>>>>> "num_scrub_errors": 1, >>>>>>>> "num_shallow_scrub_errors": 1, >>>>>>>> "num_deep_scrub_errors": 0, >>>>>>>> "num_objects_recovered": 103598, >>>>>>>> "num_bytes_recovered": 424107954567, >>>>>>>> "num_keys_recovered": 110, >>>>>>>> "num_objects_omap": 1, >>>>>>>> "num_objects_hit_set_archive": 0, >>>>>>>> "num_bytes_hit_set_archive": 0, >>>>>>>> "num_flush": 0, >>>>>>>> "num_flush_kb": 0, >>>>>>>> "num_evict": 0, >>>>>>>> "num_evict_kb": 0, >>>>>>>> "num_promote": 0, >>>>>>>> "num_flush_mode_high": 0, >>>>>>>> "num_flush_mode_low": 0, >>>>>>>> "num_evict_mode_some": 0, >>>>>>>> "num_evict_mode_full": 0, >>>>>>>> "num_objects_pinned": 0 >>>>>>>> }, >>>>>>>> "up": [ >>>>>>>> 37, >>>>>>>> 44, >>>>>>>> 16 >>>>>>>> ], >>>>>>>> "acting": [ >>>>>>>> 37, >>>>>>>> 44, >>>>>>>> 16 >>>>>>>> ], >>>>>>>> "blocked_by": [], >>>>>>>> "up_primary": 37, >>>>>>>> "acting_primary": 37 >>>>>>>> }, >>>>>>>> "empty": 0, >>>>>>>> "dne": 0, >>>>>>>> "incomplete": 0, >>>>>>>> "last_epoch_started": 30580, >>>>>>>> "hit_set_history": { >>>>>>>> "current_last_update": "0'0", >>>>>>>> "history": [] >>>>>>>> } >>>>>>>> }, >>>>>>>> "peer_info": [ >>>>>>>> { >>>>>>>> "peer": "16", >>>>>>>> "pgid": "0.190", >>>>>>>> "last_update": "30581'5420535", >>>>>>>> "last_complete": "30581'5420535", >>>>>>>> "log_tail": "30537'5387475", >>>>>>>> "last_user_version": 5390577, >>>>>>>> "last_backfill": "MAX", >>>>>>>> "last_backfill_bitwise": 1, >>>>>>>> "purged_snaps": "[]", >>>>>>>> "history": { >>>>>>>> "epoch_created": 1, >>>>>>>> "last_epoch_started": 30580, >>>>>>>> "last_epoch_clean": 30581, >>>>>>>> "last_epoch_split": 0, >>>>>>>> "last_epoch_marked_full": 0, >>>>>>>> "same_up_since": 30578, >>>>>>>> "same_interval_since": 30579, >>>>>>>> "same_primary_since": 30565, >>>>>>>> "last_scrub": "30554'5390240", >>>>>>>> "last_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_deep_scrub": "30554'5390240", >>>>>>>> "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555" >>>>>>>> }, >>>>>>>> "stats": { >>>>>>>> "version": "30570'5390575", >>>>>>>> "reported_seq": "5139870", >>>>>>>> "reported_epoch": "30576", >>>>>>>> "state": "active+undersized+degraded+inconsistent", >>>>>>>> "last_fresh": "2018-07-16 13:36:40.284756", >>>>>>>> "last_change": "2018-07-16 13:36:40.284277", >>>>>>>> "last_active": "2018-07-16 13:36:40.284756", >>>>>>>> "last_peered": "2018-07-16 13:36:40.284756", >>>>>>>> "last_clean": "2018-07-16 13:36:23.558224", >>>>>>>> "last_became_active": "2018-07-16 13:36:40.284277", >>>>>>>> "last_became_peered": "2018-07-16 13:36:40.284277", >>>>>>>> "last_unstale": "2018-07-16 13:36:40.284756", >>>>>>>> "last_undegraded": "2018-07-16 13:36:40.203248", >>>>>>>> "last_fullsized": "2018-07-16 13:36:40.203248", >>>>>>>> "mapping_epoch": 30578, >>>>>>>> "log_start": "30537'5387475", >>>>>>>> "ondisk_log_start": "30537'5387475", >>>>>>>> "created": 1, >>>>>>>> "last_epoch_clean": 30576, >>>>>>>> "parent": "0.0", >>>>>>>> "parent_split_bits": 0, >>>>>>>> "last_scrub": "30554'5390240", >>>>>>>> "last_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_deep_scrub": "30554'5390240", >>>>>>>> "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555", >>>>>>>> "log_size": 3100, >>>>>>>> "ondisk_log_size": 3100, >>>>>>>> "stats_invalid": false, >>>>>>>> "dirty_stats_invalid": false, >>>>>>>> "omap_stats_invalid": false, >>>>>>>> "hitset_stats_invalid": false, >>>>>>>> "hitset_bytes_stats_invalid": false, >>>>>>>> "pin_stats_invalid": true, >>>>>>>> "stat_sum": { >>>>>>>> "num_bytes": 16841281553, >>>>>>>> "num_objects": 4123, >>>>>>>> "num_object_clones": 0, >>>>>>>> "num_object_copies": 12369, >>>>>>>> "num_objects_missing_on_primary": 0, >>>>>>>> "num_objects_missing": 0, >>>>>>>> "num_objects_degraded": 4123, >>>>>>>> "num_objects_misplaced": 0, >>>>>>>> "num_objects_unfound": 0, >>>>>>>> "num_objects_dirty": 4123, >>>>>>>> "num_whiteouts": 0, >>>>>>>> "num_read": 6870027, >>>>>>>> "num_read_kb": 291425720, >>>>>>>> "num_write": 9972836, >>>>>>>> "num_write_kb": 184701865, >>>>>>>> "num_scrub_errors": 1, >>>>>>>> "num_shallow_scrub_errors": 1, >>>>>>>> "num_deep_scrub_errors": 0, >>>>>>>> "num_objects_recovered": 103596, >>>>>>>> "num_bytes_recovered": 424099565959, >>>>>>>> "num_keys_recovered": 110, >>>>>>>> "num_objects_omap": 1, >>>>>>>> "num_objects_hit_set_archive": 0, >>>>>>>> "num_bytes_hit_set_archive": 0, >>>>>>>> "num_flush": 0, >>>>>>>> "num_flush_kb": 0, >>>>>>>> "num_evict": 0, >>>>>>>> "num_evict_kb": 0, >>>>>>>> "num_promote": 0, >>>>>>>> "num_flush_mode_high": 0, >>>>>>>> "num_flush_mode_low": 0, >>>>>>>> "num_evict_mode_some": 0, >>>>>>>> "num_evict_mode_full": 0, >>>>>>>> "num_objects_pinned": 0 >>>>>>>> }, >>>>>>>> "up": [ >>>>>>>> 37, >>>>>>>> 44, >>>>>>>> 16 >>>>>>>> ], >>>>>>>> "acting": [ >>>>>>>> 37, >>>>>>>> 44, >>>>>>>> 16 >>>>>>>> ], >>>>>>>> "blocked_by": [], >>>>>>>> "up_primary": 37, >>>>>>>> "acting_primary": 37 >>>>>>>> }, >>>>>>>> "empty": 0, >>>>>>>> "dne": 0, >>>>>>>> "incomplete": 0, >>>>>>>> "last_epoch_started": 30580, >>>>>>>> "hit_set_history": { >>>>>>>> "current_last_update": "0'0", >>>>>>>> "history": [] >>>>>>>> } >>>>>>>> }, >>>>>>>> { >>>>>>>> "peer": "44", >>>>>>>> "pgid": "0.190", >>>>>>>> "last_update": "30581'5420535", >>>>>>>> "last_complete": "30570'5390575", >>>>>>>> "log_tail": "30537'5387475", >>>>>>>> "last_user_version": 5390575, >>>>>>>> "last_backfill": "MAX", >>>>>>>> "last_backfill_bitwise": 1, >>>>>>>> "purged_snaps": "[]", >>>>>>>> "history": { >>>>>>>> "epoch_created": 1, >>>>>>>> "last_epoch_started": 30580, >>>>>>>> "last_epoch_clean": 30581, >>>>>>>> "last_epoch_split": 0, >>>>>>>> "last_epoch_marked_full": 0, >>>>>>>> "same_up_since": 30578, >>>>>>>> "same_interval_since": 30579, >>>>>>>> "same_primary_since": 30565, >>>>>>>> "last_scrub": "30554'5390240", >>>>>>>> "last_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_deep_scrub": "30554'5390240", >>>>>>>> "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555" >>>>>>>> }, >>>>>>>> "stats": { >>>>>>>> "version": "30568'5390574", >>>>>>>> "reported_seq": "5139846", >>>>>>>> "reported_epoch": "30570", >>>>>>>> "state": "active+undersized+degraded+inconsistent", >>>>>>>> "last_fresh": "2018-07-16 13:36:07.003551", >>>>>>>> "last_change": "2018-07-16 13:36:07.002580", >>>>>>>> "last_active": "2018-07-16 13:36:07.003551", >>>>>>>> "last_peered": "2018-07-16 13:36:07.003551", >>>>>>>> "last_clean": "2018-07-16 13:35:50.922619", >>>>>>>> "last_became_active": "2018-07-16 13:36:07.002580", >>>>>>>> "last_became_peered": "2018-07-16 13:36:07.002580", >>>>>>>> "last_unstale": "2018-07-16 13:36:07.003551", >>>>>>>> "last_undegraded": "2018-07-16 13:36:05.922413", >>>>>>>> "last_fullsized": "2018-07-16 13:36:05.922413", >>>>>>>> "mapping_epoch": 30578, >>>>>>>> "log_start": "30537'5387475", >>>>>>>> "ondisk_log_start": "30537'5387475", >>>>>>>> "created": 1, >>>>>>>> "last_epoch_clean": 30570, >>>>>>>> "parent": "0.0", >>>>>>>> "parent_split_bits": 0, >>>>>>>> "last_scrub": "30554'5390240", >>>>>>>> "last_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_deep_scrub": "30554'5390240", >>>>>>>> "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", >>>>>>>> "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555", >>>>>>>> "log_size": 3099, >>>>>>>> "ondisk_log_size": 3099, >>>>>>>> "stats_invalid": false, >>>>>>>> "dirty_stats_invalid": false, >>>>>>>> "omap_stats_invalid": false, >>>>>>>> "hitset_stats_invalid": false, >>>>>>>> "hitset_bytes_stats_invalid": false, >>>>>>>> "pin_stats_invalid": true, >>>>>>>> "stat_sum": { >>>>>>>> "num_bytes": 16841281553, >>>>>>>> "num_objects": 4123, >>>>>>>> "num_object_clones": 0, >>>>>>>> "num_object_copies": 12369, >>>>>>>> "num_objects_missing_on_primary": 0, >>>>>>>> "num_objects_missing": 0, >>>>>>>> "num_objects_degraded": 4123, >>>>>>>> "num_objects_misplaced": 0, >>>>>>>> "num_objects_unfound": 0, >>>>>>>> "num_objects_dirty": 4123, >>>>>>>> "num_whiteouts": 0, >>>>>>>> "num_read": 6870027, >>>>>>>> "num_read_kb": 291425720, >>>>>>>> "num_write": 9972832, >>>>>>>> "num_write_kb": 184701853, >>>>>>>> "num_scrub_errors": 1, >>>>>>>> "num_shallow_scrub_errors": 1, >>>>>>>> "num_deep_scrub_errors": 0, >>>>>>>> "num_objects_recovered": 103594, >>>>>>>> "num_bytes_recovered": 424091177351, >>>>>>>> "num_keys_recovered": 110, >>>>>>>> "num_objects_omap": 1, >>>>>>>> "num_objects_hit_set_archive": 0, >>>>>>>> "num_bytes_hit_set_archive": 0, >>>>>>>> "num_flush": 0, >>>>>>>> "num_flush_kb": 0, >>>>>>>> "num_evict": 0, >>>>>>>> "num_evict_kb": 0, >>>>>>>> "num_promote": 0, >>>>>>>> "num_flush_mode_high": 0, >>>>>>>> "num_flush_mode_low": 0, >>>>>>>> "num_evict_mode_some": 0, >>>>>>>> "num_evict_mode_full": 0, >>>>>>>> "num_objects_pinned": 0 >>>>>>>> }, >>>>>>>> "up": [ >>>>>>>> 37, >>>>>>>> 44, >>>>>>>> 16 >>>>>>>> ], >>>>>>>> "acting": [ >>>>>>>> 37, >>>>>>>> 44, >>>>>>>> 16 >>>>>>>> ], >>>>>>>> "blocked_by": [], >>>>>>>> "up_primary": 37, >>>>>>>> "acting_primary": 37 >>>>>>>> }, >>>>>>>> "empty": 0, >>>>>>>> "dne": 0, >>>>>>>> "incomplete": 0, >>>>>>>> "last_epoch_started": 30580, >>>>>>>> "hit_set_history": { >>>>>>>> "current_last_update": "0'0", >>>>>>>> "history": [] >>>>>>>> } >>>>>>>> } >>>>>>>> ], >>>>>>>> "recovery_state": [ >>>>>>>> { >>>>>>>> "name": "Started\/Primary\/Active", >>>>>>>> "enter_time": "2018-07-16 13:37:13.050211", >>>>>>>> "might_have_unfound": [ >>>>>>>> { >>>>>>>> "osd": "16", >>>>>>>> "status": "already probed" >>>>>>>> }, >>>>>>>> { >>>>>>>> "osd": "44", >>>>>>>> "status": "already probed" >>>>>>>> } >>>>>>>> ], >>>>>>>> "recovery_progress": { >>>>>>>> "backfill_targets": [], >>>>>>>> "waiting_on_backfill": [], >>>>>>>> "last_backfill_started": "MIN", >>>>>>>> "backfill_info": { >>>>>>>> "begin": "MIN", >>>>>>>> "end": "MIN", >>>>>>>> "objects": [] >>>>>>>> }, >>>>>>>> "peer_backfill_info": [], >>>>>>>> "backfills_in_flight": [], >>>>>>>> "recovering": [], >>>>>>>> "pg_backend": { >>>>>>>> "pull_from_peer": [], >>>>>>>> "pushing": [] >>>>>>>> } >>>>>>>> }, >>>>>>>> "scrub": { >>>>>>>> "scrubber.epoch_start": "0", >>>>>>>> "scrubber.active": 0, >>>>>>>> "scrubber.state": "INACTIVE", >>>>>>>> "scrubber.start": "MIN", >>>>>>>> "scrubber.end": "MIN", >>>>>>>> "scrubber.subset_last_update": "0'0", >>>>>>>> "scrubber.deep": false, >>>>>>>> "scrubber.seed": 0, >>>>>>>> "scrubber.waiting_on": 0, >>>>>>>> "scrubber.waiting_on_whom": [] >>>>>>>> } >>>>>>>> }, >>>>>>>> { >>>>>>>> "name": "Started", >>>>>>>> "enter_time": "2018-07-16 13:37:11.980264" >>>>>>>> } >>>>>>>> ], >>>>>>>> "agent_state": {} >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> On 17/07/18 02:19, Brad Hubbard wrote: >>>>>>>>> Can we see a pg query of 0.190 ? >>>>>>>>> >>>>>>>>> On Tue, Jul 17, 2018 at 1:05 AM, Ana Aviles <ana@xxxxxxxxxxxx> wrote: >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> We have a cluster that was running hammer (0.94.10). We hit a bug where >>>>>>>>>> right after seemingly fixing an inconsistent PG, the primary OSD would >>>>>>>>>> crash and restart. Next deep-scrub will again return inconsistent PG. >>>>>>>>>> >>>>>>>>>> We filled in a bug issue >>>>>>>>>> https://tracker.ceph.com/issues/24652#change-115654 that was closed >>>>>>>>>> since it was a known bug fixed in newer versions of Ceph. >>>>>>>>>> >>>>>>>>>> Now the cluster is running jewel (10.2.11). There is again one >>>>>>>>>> inconsistent PG with 1 error which not able to fix and with no >>>>>>>>>> reference to the inconsistent object. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> scrub 0 missing, 1 inconsistent objects >>>>>>>>>> scrub 1 errors >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We have the logs with debug level 20 while repairing the PG. The one for >>>>>>>>>> the primary OSD is: 94e20123-fcda-49d7-98a2-919507dfbc92 >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> Kind regards, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ana Avilés >>>>>>>>>> Greenhost - sustainable hosting & digital security >>>>>>>>>> E: ana@xxxxxxxxxxxx >>>>>>>>>> T: +31 20 4890444 >>>>>>>>>> W: https://greenhost.nl >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ana Avilés >>>>>>>> Greenhost - sustainable hosting & digital security >>>>>>>> E: ana@xxxxxxxxxxxx >>>>>>>> T: +31 20 4890444 >>>>>>>> W: https://greenhost.nl >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ana Avilés >>>>>> Greenhost - sustainable hosting & digital security >>>>>> E: ana@xxxxxxxxxxxx >>>>>> T: +31 20 4890444 >>>>>> W: https://greenhost.nl >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ana Avilés >>>> Greenhost - sustainable hosting & digital security >>>> E: ana@xxxxxxxxxxxx >>>> T: +31 20 4890444 >>>> W: https://greenhost.nl >>> >>> >>> >> >> -- >> Ana Avilés >> Greenhost - sustainable hosting & digital security >> E: ana@xxxxxxxxxxxx >> T: +31 20 4890444 >> W: https://greenhost.nl > > > -- Ana Avilés Greenhost - sustainable hosting & digital security E: ana@xxxxxxxxxxxx T: +31 20 4890444 W: https://greenhost.nl -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html