cephfs metadata pool: deep-scrub error "omap_digest != best guess omap_digest"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Ceph / CephFS supports.

We are currently running Jewel 10.2.2.

From time to time we experience deep-scrub errors in pgs inside our cephfs metadata pool. It is important to note that we do not see any hardware errors on the osds themselves so the error must have some other source.

The error itself is the following:

# cat /var/log/ceph/ceph.log| grep 5.3d0
2016-08-30 00:30:53.492626 osd.78 192.231.127.171:6828/6072 331 : cluster [INF] 5.3d0 deep-scrub starts
2016-08-30 00:30:54.276134 osd.78 192.231.127.171:6828/6072 332 : cluster [ERR] 5.3d0 shard 78: soid 5:0bd6d154:::602.00000000:head omap_digest 0xf3fdfd0c != best guess omap_digest 0x23b2eae0 from auth shard 49
2016-08-30 00:30:54.747795 osd.78 192.231.127.171:6828/6072 333 : cluster [ERR] 5.3d0 deep-scrub 0 missing, 1 inconsistent objects
2016-08-30 00:30:54.747801 osd.78 192.231.127.171:6828/6072 334 : cluster [ERR] 5.3d0 deep-scrub 1 errors

For us there are a few unknowns on how to recover from this error:

1) The first issue is that we do really do not understand the nature of the error. What does it mean "omap_digest != best guess omap_digest"? it seems to point to some problem in the digest of omap contents between the two osds but does not tell you exactly what.
 
2) The second issue is that it is really difficult to try to explore metadata objects and omap info. While in the data pool we do know how to inspect pgs and object contents (and decide what is the problematic osd by comparison in a 3 replica setup), in the metadata pool we have to access pg contents using 'ceph-objectstore-tool'. For that, we have to stop the osd so that the deamon releases the omap lock. Moreover, I have successfully imported / exported / listed pgs contents but I was never able to query omap contents of objects inside pgs. Maybe I am doing it wrong but I do not find the tool helpful at the moment for this precise task.

3)  Finally, I am unsure what is the consequence of running 'pg repair'. In my specific case, the primary osd is on a host which was recently added to production. Moreover, the dates of the problematic object match for the secondary piers which gives me the feeling that the primary osd (78) might be the problematic one. I know that in the past, the default behavior was simply to copy the pg contents of the primary osd to the others. That can lead to data corruption if the problematic osd is indeed the primary, and I wonder if in Jewel there is some smarter way to do the pg repair.

[root@server9 ~]# ll /var/lib/ceph/osd/ceph-78/current/5.3d0_head/602.00000000__head_2A8B6BD0__5
-rw-r--r-- 1 ceph ceph 0 Aug 25 21:41 /var/lib/ceph/osd/ceph-78/current/5.3d0_head/602.00000000__head_2A8B6BD0__5


[root@server7 ~]# ll /var/lib/ceph/osd/ceph-49/current/5.3d0_head/602.00000000__head_2A8B6BD0__5
-rw-r--r-- 1 ceph ceph 0 Jul 27 02:30 /var/lib/ceph/osd/ceph-49/current/5.3d0_head/602.00000000__head_2A8B6BD0__5


[root@server8 ~]# ll /var/lib/ceph/osd/ceph-59/current/5.3d0_head/602.00000000__head_2A8B6BD0__5
-rw-r--r-- 1 ceph ceph 0 Jul 27 02:30 /var/lib/ceph/osd/ceph-59/current/5.3d0_head/602.00000000__head_2A8B6BD0__5

A pg query does not seem to give me any information about the problem. The query results follows this email.

Any help is appreciated.

Cheers
Goncalo

--- * ---



# ceph pg 5.3d0 query
{
    "state": "active+clean+inconsistent",
    "snap_trimq": "[]",
    "epoch": 23099,
    "up": [
        78,
        59,
        49
    ],
    "acting": [
        78,
        59,
        49
    ],
    "actingbackfill": [
        "49",
        "59",
        "78"
    ],
    "info": {
        "pgid": "5.3d0",
        "last_update": "23099'104726",
        "last_complete": "23099'104726",
        "log_tail": "23099'101639",
        "last_user_version": 104726,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 1,
        "purged_snaps": "[]",
        "history": {
            "epoch_created": 339,
            "last_epoch_started": 22440,
            "last_epoch_clean": 22440,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 19928,
            "same_interval_since": 22439,
            "same_primary_since": 22439,
            "last_scrub": "23099'104724",
            "last_scrub_stamp": "2016-08-30 00:30:54.747833",
            "last_deep_scrub": "23099'104724",
            "last_deep_scrub_stamp": "2016-08-30 00:30:54.747833",
            "last_clean_scrub_stamp": "2016-08-29 00:30:33.716646"
        },
        "stats": {
            "version": "23099'104726",
            "reported_seq": "84233",
            "reported_epoch": "23099",
            "state": "active+clean+inconsistent",
            "last_fresh": "2016-08-30 02:40:35.963747",
            "last_change": "2016-08-30 00:30:54.747882",
            "last_active": "2016-08-30 02:40:35.963747",
            "last_peered": "2016-08-30 02:40:35.963747",
            "last_clean": "2016-08-30 02:40:35.963747",
            "last_became_active": "2016-08-25 21:41:30.649369",
            "last_became_peered": "2016-08-25 21:41:30.649369",
            "last_unstale": "2016-08-30 02:40:35.963747",
            "last_undegraded": "2016-08-30 02:40:35.963747",
            "last_fullsized": "2016-08-30 02:40:35.963747",
            "mapping_epoch": 19928,
            "log_start": "23099'101639",
            "ondisk_log_start": "23099'101639",
            "created": 339,
            "last_epoch_clean": 22440,
            "parent": "0.0",
            "parent_split_bits": 0,
            "last_scrub": "23099'104724",
            "last_scrub_stamp": "2016-08-30 00:30:54.747833",
            "last_deep_scrub": "23099'104724",
            "last_deep_scrub_stamp": "2016-08-30 00:30:54.747833",
            "last_clean_scrub_stamp": "2016-08-29 00:30:33.716646",
            "log_size": 3087,
            "ondisk_log_size": 3087,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": true,
            "stat_sum": {
                "num_bytes": 0,
                "num_objects": 257,
                "num_object_clones": 0,
                "num_object_copies": 771,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 257,
                "num_whiteouts": 0,
                "num_read": 21865,
                "num_read_kb": 378449,
                "num_write": 106287,
                "num_write_kb": 402800,
                "num_scrub_errors": 1,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 1,
                "num_objects_recovered": 2006,
                "num_bytes_recovered": 0,
                "num_keys_recovered": 124614,
                "num_objects_omap": 257,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0
            },
            "up": [
                78,
                59,
                49
            ],
            "acting": [
                78,
                59,
                49
            ],
            "blocked_by": [],
            "up_primary": 78,
            "acting_primary": 78
        },
        "empty": 0,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 22440,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [
        {
            "peer": "49",
            "pgid": "5.3d0",
            "last_update": "23099'104726",
            "last_complete": "23099'104726",
            "log_tail": "1963'93313",
            "last_user_version": 96444,
            "last_backfill": "MAX",
            "last_backfill_bitwise": 1,
            "purged_snaps": "[]",
            "history": {
                "epoch_created": 339,
                "last_epoch_started": 22440,
                "last_epoch_clean": 22440,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 19928,
                "same_interval_since": 22439,
                "same_primary_since": 22439,
                "last_scrub": "23099'104724",
                "last_scrub_stamp": "2016-08-30 00:30:54.747833",
                "last_deep_scrub": "23099'104724",
                "last_deep_scrub_stamp": "2016-08-30 00:30:54.747833",
                "last_clean_scrub_stamp": "2016-08-29 00:30:33.716646"
            },
            "stats": {
                "version": "20737'96443",
                "reported_seq": "77807",
                "reported_epoch": "22439",
                "state": "active+remapped+wait_backfill",
                "last_fresh": "2016-08-25 06:41:22.446581",
                "last_change": "2016-08-24 23:56:20.865302",
                "last_active": "2016-08-25 06:41:22.446581",
                "last_peered": "2016-08-25 06:41:22.446581",
                "last_clean": "2016-08-24 13:42:06.161197",
                "last_became_active": "2016-08-24 23:56:19.815284",
                "last_became_peered": "2016-08-24 23:56:19.815284",
                "last_unstale": "2016-08-25 06:41:22.446581",
                "last_undegraded": "2016-08-25 06:41:22.446581",
                "last_fullsized": "2016-08-25 06:41:22.446581",
                "mapping_epoch": 19928,
                "log_start": "1963'93313",
                "ondisk_log_start": "1963'93313",
                "created": 339,
                "last_epoch_clean": 17445,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "19699'96439",
                "last_scrub_stamp": "2016-08-24 22:59:27.749260",
                "last_deep_scrub": "16645'96391",
                "last_deep_scrub_stamp": "2016-08-22 20:21:59.567449",
                "last_clean_scrub_stamp": "2016-08-24 22:59:27.749260",
                "log_size": 3130,
                "ondisk_log_size": 3130,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": true,
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 252,
                    "num_object_clones": 0,
                    "num_object_copies": 1008,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 2,
                    "num_objects_misplaced": 504,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 252,
                    "num_whiteouts": 0,
                    "num_read": 21538,
                    "num_read_kb": 323200,
                    "num_write": 97965,
                    "num_write_kb": 354745,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 2006,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 124614,
                    "num_objects_omap": 252,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0
                },
                "up": [
                    78,
                    59,
                    49
                ],
                "acting": [
                    78,
                    59,
                    49
                ],
                "blocked_by": [],
                "up_primary": 78,
                "acting_primary": 78
            },
            "empty": 0,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 22440,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        },
        {
            "peer": "59",
            "pgid": "5.3d0",
            "last_update": "23099'104726",
            "last_complete": "23099'104726",
            "log_tail": "1963'93313",
            "last_user_version": 96444,
            "last_backfill": "MAX",
            "last_backfill_bitwise": 1,
            "purged_snaps": "[]",
            "history": {
                "epoch_created": 339,
                "last_epoch_started": 22440,
                "last_epoch_clean": 22440,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 19928,
                "same_interval_since": 22439,
                "same_primary_since": 22439,
                "last_scrub": "23099'104724",
                "last_scrub_stamp": "2016-08-30 00:30:54.747833",
                "last_deep_scrub": "23099'104724",
                "last_deep_scrub_stamp": "2016-08-30 00:30:54.747833",
                "last_clean_scrub_stamp": "2016-08-29 00:30:33.716646"
            },
            "stats": {
                "version": "20737'96444",
                "reported_seq": "77806",
                "reported_epoch": "22437",
                "state": "active+remapped",
                "last_fresh": "2016-08-25 21:41:28.869909",
                "last_change": "2016-08-25 21:41:28.869350",
                "last_active": "2016-08-25 21:41:28.869909",
                "last_peered": "2016-08-25 21:41:28.869909",
                "last_clean": "2016-08-24 13:42:06.161197",
                "last_became_active": "2016-08-24 23:56:19.815284",
                "last_became_peered": "2016-08-24 23:56:19.815284",
                "last_unstale": "2016-08-25 21:41:28.869909",
                "last_undegraded": "2016-08-25 21:41:28.869909",
                "last_fullsized": "2016-08-25 21:41:28.869909",
                "mapping_epoch": 19928,
                "log_start": "1963'93313",
                "ondisk_log_start": "1963'93313",
                "created": 339,
                "last_epoch_clean": 22437,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "19699'96439",
                "last_scrub_stamp": "2016-08-24 22:59:27.749260",
                "last_deep_scrub": "16645'96391",
                "last_deep_scrub_stamp": "2016-08-22 20:21:59.567449",
                "last_clean_scrub_stamp": "2016-08-24 22:59:27.749260",
                "log_size": 3131,
                "ondisk_log_size": 3131,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": true,
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 252,
                    "num_object_clones": 0,
                    "num_object_copies": 1008,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 2,
                    "num_objects_misplaced": 252,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 252,
                    "num_whiteouts": 0,
                    "num_read": 21538,
                    "num_read_kb": 323200,
                    "num_write": 97965,
                    "num_write_kb": 354745,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 2510,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 136360,
                    "num_objects_omap": 252,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0
                },
                "up": [
                    78,
                    59,
                    49
                ],
                "acting": [
                    78,
                    59,
                    49
                ],
                "blocked_by": [],
                "up_primary": 78,
                "acting_primary": 78
            },
            "empty": 0,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 22440,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        }
    ],
    "recovery_state": [
        {
            "name": "Started\/Primary\/Active",
            "enter_time": "2016-08-25 21:41:30.400460",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "pull_from_peer": [],
                    "pushing": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "22439",
                "scrubber.active": 0,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.seed": 0,
                "scrubber.waiting_on": 0,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2016-08-25 21:41:29.291162"
        }
    ],
    "agent_state": {}
}






-- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux