OK. What I *meant* to ask for was the output of "rados list-inconsistent-obj 0.190" (might still be worth posting that but it should just confirm findings below). The relevant lines from the log are below. 2018-07-16 12:24:45.940910 7fb422340700 2 osd.37 pg_epoch: 30554 pg[0.190( v 30554'5390084 (30537'5387075,30554'5390084] local-les=30554 n=4123 ec=1 les/c/f 30554/30554/0 30552/30553/30542) [37,44,16] r=0 lpr=30553 crt=30554'5390079 lcod 30554'5390083 mlcod 30554'5390083 active+clean+scrubbing+deep+inconsistent+repair] 0.190 shard 16: soid 0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head data_digest 0x264b7d0d != data_digest 0x7dd0d0bd from shard 44, data_digest 0x264b7d0d != data_digest 0x7dd0d0bd from auth oi 0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd 7dd0d0bd od ffffffff alloc_hint [4194304 4194304]), attr value mismatch '_' 2018-07-16 12:24:45.940941 7fb422340700 -1 log_channel(cluster) log [ERR] : 0.190 shard 16: soid 0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head data_digest 0x264b7d0d != data_digest 0x7dd0d0bd from shard 44, data_digest 0x264b7d0d != data_digest 0x7dd0d0bd from auth oi 0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd 7dd0d0bd od ffffffff alloc_hint [4194304 4194304]), attr value mismatch '_' 2018-07-16 12:24:45.940957 7fb422340700 -1 log_channel(cluster) log [ERR] : 0.190 shard 37: soid 0:09c2dd3e:::rbd_data.15cec2ae8944a.000000000015c7d6:head data_digest 0x264b7d0d != data_digest 0x7dd0d0bd from shard 44, data_digest 0x264b7d0d != data_digest 0x7dd0d0bd from auth oi 0:618e3778:::rbd_data.15cec2ae8944a.000000000004db0e:head(30537'5509201 osd.36.0:8552301 dirty|data_digest|omap_digest s 4194304 uv 5498082 dd 7dd0d0bd od ffffffff alloc_hint [4194304 4194304]), attr value mismatch '_' They show that osd 44 has been chosen as the authoritative shard and and it has a data digest for this object of 0x7dd0d0bd and that the data digest in the authoritative object info is also 0x7dd0d0bd. Shard 16 however, has a data digest of 0x264b7d0d and so does shard 37 so the data for this object on osds 16 and 37 is different to that on osd 44. Basically, you'll need to pick which is the "right" copy of the object (I can't tell you) quiesce traffic to/from that object (rbd image) and get/put that object back into the cluster to fix the mismatch. Since this appears to be an rbd image this could potentially result in an image that needs an fsck or equivalent IIUC. On Tue, Jul 17, 2018 at 10:06 PM, Ana Aviles <ana@xxxxxxxxxxxx> wrote: > > Hi Brad, > > Here it is: > > { > "state": "active+clean+inconsistent", > "snap_trimq": "[]", > "epoch": 30581, > "up": [ > 37, > 44, > 16 > ], > "acting": [ > 37, > 44, > 16 > ], > "actingbackfill": [ > "16", > "37", > "44" > ], > "info": { > "pgid": "0.190", > "last_update": "30581'5420535", > "last_complete": "30581'5420535", > "log_tail": "30581'5417484", > "last_user_version": 5420535, > "last_backfill": "MAX", > "last_backfill_bitwise": 0, > "purged_snaps": "[]", > "history": { > "epoch_created": 1, > "last_epoch_started": 30580, > "last_epoch_clean": 30581, > "last_epoch_split": 0, > "last_epoch_marked_full": 0, > "same_up_since": 30578, > "same_interval_since": 30579, > "same_primary_since": 30565, > "last_scrub": "30554'5390240", > "last_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_deep_scrub": "30554'5390240", > "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555" > }, > "stats": { > "version": "30581'5420535", > "reported_seq": "5155553", > "reported_epoch": "30581", > "state": "active+clean+inconsistent", > "last_fresh": "2018-07-17 12:02:13.002428", > "last_change": "2018-07-16 13:37:24.020403", > "last_active": "2018-07-17 12:02:13.002428", > "last_peered": "2018-07-17 12:02:13.002428", > "last_clean": "2018-07-17 12:02:13.002428", > "last_became_active": "2018-07-16 13:37:13.173821", > "last_became_peered": "2018-07-16 13:37:13.173821", > "last_unstale": "2018-07-17 12:02:13.002428", > "last_undegraded": "2018-07-17 12:02:13.002428", > "last_fullsized": "2018-07-17 12:02:13.002428", > "mapping_epoch": 30578, > "log_start": "30581'5417484", > "ondisk_log_start": "30581'5417484", > "created": 1, > "last_epoch_clean": 30581, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "30554'5390240", > "last_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_deep_scrub": "30554'5390240", > "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555", > "log_size": 3051, > "ondisk_log_size": 3051, > "stats_invalid": false, > "dirty_stats_invalid": false, > "omap_stats_invalid": false, > "hitset_stats_invalid": false, > "hitset_bytes_stats_invalid": false, > "pin_stats_invalid": true, > "stat_sum": { > "num_bytes": 16946139153, > "num_objects": 4148, > "num_object_clones": 0, > "num_object_copies": 12444, > "num_objects_missing_on_primary": 0, > "num_objects_missing": 0, > "num_objects_degraded": 0, > "num_objects_misplaced": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 4148, > "num_whiteouts": 0, > "num_read": 6895104, > "num_read_kb": 292185552, > "num_write": 10032749, > "num_write_kb": 185167701, > "num_scrub_errors": 1, > "num_shallow_scrub_errors": 1, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 103598, > "num_bytes_recovered": 424107954567, > "num_keys_recovered": 110, > "num_objects_omap": 1, > "num_objects_hit_set_archive": 0, > "num_bytes_hit_set_archive": 0, > "num_flush": 0, > "num_flush_kb": 0, > "num_evict": 0, > "num_evict_kb": 0, > "num_promote": 0, > "num_flush_mode_high": 0, > "num_flush_mode_low": 0, > "num_evict_mode_some": 0, > "num_evict_mode_full": 0, > "num_objects_pinned": 0 > }, > "up": [ > 37, > 44, > 16 > ], > "acting": [ > 37, > 44, > 16 > ], > "blocked_by": [], > "up_primary": 37, > "acting_primary": 37 > }, > "empty": 0, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 30580, > "hit_set_history": { > "current_last_update": "0'0", > "history": [] > } > }, > "peer_info": [ > { > "peer": "16", > "pgid": "0.190", > "last_update": "30581'5420535", > "last_complete": "30581'5420535", > "log_tail": "30537'5387475", > "last_user_version": 5390577, > "last_backfill": "MAX", > "last_backfill_bitwise": 1, > "purged_snaps": "[]", > "history": { > "epoch_created": 1, > "last_epoch_started": 30580, > "last_epoch_clean": 30581, > "last_epoch_split": 0, > "last_epoch_marked_full": 0, > "same_up_since": 30578, > "same_interval_since": 30579, > "same_primary_since": 30565, > "last_scrub": "30554'5390240", > "last_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_deep_scrub": "30554'5390240", > "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555" > }, > "stats": { > "version": "30570'5390575", > "reported_seq": "5139870", > "reported_epoch": "30576", > "state": "active+undersized+degraded+inconsistent", > "last_fresh": "2018-07-16 13:36:40.284756", > "last_change": "2018-07-16 13:36:40.284277", > "last_active": "2018-07-16 13:36:40.284756", > "last_peered": "2018-07-16 13:36:40.284756", > "last_clean": "2018-07-16 13:36:23.558224", > "last_became_active": "2018-07-16 13:36:40.284277", > "last_became_peered": "2018-07-16 13:36:40.284277", > "last_unstale": "2018-07-16 13:36:40.284756", > "last_undegraded": "2018-07-16 13:36:40.203248", > "last_fullsized": "2018-07-16 13:36:40.203248", > "mapping_epoch": 30578, > "log_start": "30537'5387475", > "ondisk_log_start": "30537'5387475", > "created": 1, > "last_epoch_clean": 30576, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "30554'5390240", > "last_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_deep_scrub": "30554'5390240", > "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555", > "log_size": 3100, > "ondisk_log_size": 3100, > "stats_invalid": false, > "dirty_stats_invalid": false, > "omap_stats_invalid": false, > "hitset_stats_invalid": false, > "hitset_bytes_stats_invalid": false, > "pin_stats_invalid": true, > "stat_sum": { > "num_bytes": 16841281553, > "num_objects": 4123, > "num_object_clones": 0, > "num_object_copies": 12369, > "num_objects_missing_on_primary": 0, > "num_objects_missing": 0, > "num_objects_degraded": 4123, > "num_objects_misplaced": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 4123, > "num_whiteouts": 0, > "num_read": 6870027, > "num_read_kb": 291425720, > "num_write": 9972836, > "num_write_kb": 184701865, > "num_scrub_errors": 1, > "num_shallow_scrub_errors": 1, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 103596, > "num_bytes_recovered": 424099565959, > "num_keys_recovered": 110, > "num_objects_omap": 1, > "num_objects_hit_set_archive": 0, > "num_bytes_hit_set_archive": 0, > "num_flush": 0, > "num_flush_kb": 0, > "num_evict": 0, > "num_evict_kb": 0, > "num_promote": 0, > "num_flush_mode_high": 0, > "num_flush_mode_low": 0, > "num_evict_mode_some": 0, > "num_evict_mode_full": 0, > "num_objects_pinned": 0 > }, > "up": [ > 37, > 44, > 16 > ], > "acting": [ > 37, > 44, > 16 > ], > "blocked_by": [], > "up_primary": 37, > "acting_primary": 37 > }, > "empty": 0, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 30580, > "hit_set_history": { > "current_last_update": "0'0", > "history": [] > } > }, > { > "peer": "44", > "pgid": "0.190", > "last_update": "30581'5420535", > "last_complete": "30570'5390575", > "log_tail": "30537'5387475", > "last_user_version": 5390575, > "last_backfill": "MAX", > "last_backfill_bitwise": 1, > "purged_snaps": "[]", > "history": { > "epoch_created": 1, > "last_epoch_started": 30580, > "last_epoch_clean": 30581, > "last_epoch_split": 0, > "last_epoch_marked_full": 0, > "same_up_since": 30578, > "same_interval_since": 30579, > "same_primary_since": 30565, > "last_scrub": "30554'5390240", > "last_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_deep_scrub": "30554'5390240", > "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555" > }, > "stats": { > "version": "30568'5390574", > "reported_seq": "5139846", > "reported_epoch": "30570", > "state": "active+undersized+degraded+inconsistent", > "last_fresh": "2018-07-16 13:36:07.003551", > "last_change": "2018-07-16 13:36:07.002580", > "last_active": "2018-07-16 13:36:07.003551", > "last_peered": "2018-07-16 13:36:07.003551", > "last_clean": "2018-07-16 13:35:50.922619", > "last_became_active": "2018-07-16 13:36:07.002580", > "last_became_peered": "2018-07-16 13:36:07.002580", > "last_unstale": "2018-07-16 13:36:07.003551", > "last_undegraded": "2018-07-16 13:36:05.922413", > "last_fullsized": "2018-07-16 13:36:05.922413", > "mapping_epoch": 30578, > "log_start": "30537'5387475", > "ondisk_log_start": "30537'5387475", > "created": 1, > "last_epoch_clean": 30570, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "30554'5390240", > "last_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_deep_scrub": "30554'5390240", > "last_deep_scrub_stamp": "2018-07-16 12:27:03.547524", > "last_clean_scrub_stamp": "2018-07-13 08:45:32.622555", > "log_size": 3099, > "ondisk_log_size": 3099, > "stats_invalid": false, > "dirty_stats_invalid": false, > "omap_stats_invalid": false, > "hitset_stats_invalid": false, > "hitset_bytes_stats_invalid": false, > "pin_stats_invalid": true, > "stat_sum": { > "num_bytes": 16841281553, > "num_objects": 4123, > "num_object_clones": 0, > "num_object_copies": 12369, > "num_objects_missing_on_primary": 0, > "num_objects_missing": 0, > "num_objects_degraded": 4123, > "num_objects_misplaced": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 4123, > "num_whiteouts": 0, > "num_read": 6870027, > "num_read_kb": 291425720, > "num_write": 9972832, > "num_write_kb": 184701853, > "num_scrub_errors": 1, > "num_shallow_scrub_errors": 1, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 103594, > "num_bytes_recovered": 424091177351, > "num_keys_recovered": 110, > "num_objects_omap": 1, > "num_objects_hit_set_archive": 0, > "num_bytes_hit_set_archive": 0, > "num_flush": 0, > "num_flush_kb": 0, > "num_evict": 0, > "num_evict_kb": 0, > "num_promote": 0, > "num_flush_mode_high": 0, > "num_flush_mode_low": 0, > "num_evict_mode_some": 0, > "num_evict_mode_full": 0, > "num_objects_pinned": 0 > }, > "up": [ > 37, > 44, > 16 > ], > "acting": [ > 37, > 44, > 16 > ], > "blocked_by": [], > "up_primary": 37, > "acting_primary": 37 > }, > "empty": 0, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 30580, > "hit_set_history": { > "current_last_update": "0'0", > "history": [] > } > } > ], > "recovery_state": [ > { > "name": "Started\/Primary\/Active", > "enter_time": "2018-07-16 13:37:13.050211", > "might_have_unfound": [ > { > "osd": "16", > "status": "already probed" > }, > { > "osd": "44", > "status": "already probed" > } > ], > "recovery_progress": { > "backfill_targets": [], > "waiting_on_backfill": [], > "last_backfill_started": "MIN", > "backfill_info": { > "begin": "MIN", > "end": "MIN", > "objects": [] > }, > "peer_backfill_info": [], > "backfills_in_flight": [], > "recovering": [], > "pg_backend": { > "pull_from_peer": [], > "pushing": [] > } > }, > "scrub": { > "scrubber.epoch_start": "0", > "scrubber.active": 0, > "scrubber.state": "INACTIVE", > "scrubber.start": "MIN", > "scrubber.end": "MIN", > "scrubber.subset_last_update": "0'0", > "scrubber.deep": false, > "scrubber.seed": 0, > "scrubber.waiting_on": 0, > "scrubber.waiting_on_whom": [] > } > }, > { > "name": "Started", > "enter_time": "2018-07-16 13:37:11.980264" > } > ], > "agent_state": {} > } > > > On 17/07/18 02:19, Brad Hubbard wrote: >> Can we see a pg query of 0.190 ? >> >> On Tue, Jul 17, 2018 at 1:05 AM, Ana Aviles <ana@xxxxxxxxxxxx> wrote: >>> Hello, >>> >>> We have a cluster that was running hammer (0.94.10). We hit a bug where >>> right after seemingly fixing an inconsistent PG, the primary OSD would >>> crash and restart. Next deep-scrub will again return inconsistent PG. >>> >>> We filled in a bug issue >>> https://tracker.ceph.com/issues/24652#change-115654 that was closed >>> since it was a known bug fixed in newer versions of Ceph. >>> >>> Now the cluster is running jewel (10.2.11). There is again one >>> inconsistent PG with 1 error which not able to fix and with no >>> reference to the inconsistent object. >>> >>> >>> scrub 0 missing, 1 inconsistent objects >>> scrub 1 errors >>> >>> >>> We have the logs with debug level 20 while repairing the PG. The one for >>> the primary OSD is: 94e20123-fcda-49d7-98a2-919507dfbc92 >>> >>> Thanks! >>> Kind regards, >>> >>> >>> -- >>> Ana Avilés >>> Greenhost - sustainable hosting & digital security >>> E: ana@xxxxxxxxxxxx >>> T: +31 20 4890444 >>> W: https://greenhost.nl >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > > -- > Ana Avilés > Greenhost - sustainable hosting & digital security > E: ana@xxxxxxxxxxxx > T: +31 20 4890444 > W: https://greenhost.nl -- Cheers, Brad -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html