Stuck in creating+activating

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Good morning,

some days ago we created a new pool with 512 pgs, and originally 5 osds.
We use the device class "ssd" and a crush rule that maps all data for
the pool "ssd" to the ssd device class osds.

While creating, one of the ssds failed and we are left with 4 osds:

[10:00:22] server2.place6:/var/log/ceph# ceph osd tree
ID CLASS     WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF
-1           135.12505 root default
-7            51.36911     host server2
15   hdd-big   9.09511         osd.15       up  1.00000 1.00000
20   hdd-big   9.09511         osd.20       up  1.00000 1.00000
21   hdd-big   9.09511         osd.21       up  1.00000 1.00000
 7 hdd-small   4.54776         osd.7        up  1.00000 1.00000
 8 hdd-small   4.54776         osd.8        up  1.00000 1.00000
10 hdd-small   4.54776         osd.10       up  1.00000 1.00000
26 hdd-small   4.54776         osd.26       up  1.00000 1.00000
14  notinuse   5.45741         osd.14       up  1.00000 1.00000
12       ssd   0.21767         osd.12       up  1.00000 1.00000
24       ssd   0.21767         osd.24       up  1.00000 1.00000
-5            42.50967     host server3
 9   hdd-big   9.09511         osd.9        up  1.00000 1.00000
16   hdd-big   9.09511         osd.16       up  1.00000 1.00000
19   hdd-big   9.09511         osd.19       up  1.00000 1.00000
 3 hdd-small   4.54776         osd.3        up  1.00000 1.00000
 5 hdd-small   4.54776         osd.5        up  1.00000 1.00000
 6 hdd-small   4.54776         osd.6        up  1.00000 1.00000
11  notinuse   0.45424         osd.11       up  1.00000 1.00000
13  notinuse   0.90907         osd.13       up  1.00000 1.00000
25       ssd   0.21776         osd.25       up  1.00000 1.00000
-2            41.24626     host server4
 2   hdd-big   9.09511         osd.2        up  1.00000 1.00000
17   hdd-big   9.09511         osd.17       up  1.00000 1.00000
18   hdd-big   9.09511         osd.18       up  1.00000 1.00000
 0 hdd-small   4.54776         osd.0        up  1.00000 1.00000
 1 hdd-small   4.54776         osd.1        up  1.00000 1.00000
22 hdd-small   4.54776         osd.22       up  1.00000 1.00000
 4  notinuse   0.09999         osd.4        up  1.00000 1.00000
23       ssd   0.21767         osd.23       up  1.00000 1.00000
[10:04:27] server2.place6:/var/log/ceph#

We first had about 160 pgs stuck in creating+activating. After
restarting all osds in the ssd class one by one, it shifted to
100 activating and 60  creating+activating:


[10:00:18] server2.place6:/var/log/ceph# ceph -s
  cluster:
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
    health: HEALTH_ERR
            1803200/13770981 objects misplaced (13.094%)
            Reduced data availability: 175 pgs inactive
            Degraded data redundancy: 857547/13770981 objects degraded (6.227%), 197 pgs degraded, 123 pgs undersized
            39 slow requests are blocked > 32 sec
            40 stuck requests are blocked > 4096 sec

  services:
    mon: 3 daemons, quorum black1,black2,black3
    mgr: black3(active), standbys: black2, black1
    osd: 27 osds: 27 up, 27 in; 156 remapped pgs

  data:
    pools:   2 pools, 1024 pgs
    objects: 4482k objects, 17725 GB
    usage:   55542 GB used, 83188 GB / 135 TB avail
    pgs:     17.090% pgs not active
             857547/13770981 objects degraded (6.227%)
             1803200/13770981 objects misplaced (13.094%)
             640 active+clean
             105 active+undersized+degraded+remapped+backfill_wait
             100 activating
             60  creating+activating
             50  active+recovery_wait+degraded
             21  active+remapped+backfill_wait
             16  active+recovery_wait+undersized+degraded+remapped
             15  activating+degraded
             9   active+recovery_wait+degraded+remapped
             3   active+recovery_wait+remapped
             3   active+recovery_wait
             2   active+undersized+degraded+remapped+backfilling

  io:
    client:   519 kB/s rd, 38025 kB/s wr, 4 op/s rd, 20 op/s wr
    recovery: 1694 kB/s, 0 objects/s

I looked into the archives, but did not find anything that directly
related to our situation. We are using ceph 12.2.4.

An excerpt from our ceph health detail looks like this:

HEALTH_ERR 1803116/13770981 objects misplaced (13.094%); Reduced data availability: 175 pgs inactive; Degraded data redundancy: 856881/13770981 objects degraded (6.222%), 197 pgs degraded, 123 pgs undersized; 53 slow requests are blocked > 32 sec; 40 stuck requests are blocked > 4096 sec
OBJECT_MISPLACED 1803116/13770981 objects misplaced (13.094%)
PG_AVAILABILITY Reduced data availability: 175 pgs inactive
    pg 7.118 is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25]
    pg 7.11a is stuck inactive for 38143.679989, current state activating, last acting [25,24,23]
    pg 7.121 is stuck inactive for 38143.670149, current state activating, last acting [25,23,12]
    pg 7.123 is stuck inactive for 37184.100764, current state activating+degraded, last acting [25,12,23]
    pg 7.125 is stuck inactive for 38143.677390, current state activating, last acting [25,24,23]
    pg 7.126 is stuck inactive for 38164.127082, current state activating, last acting [24,23,25]
    pg 7.127 is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25]
    pg 7.12b is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25]

where pool 7 is the ssd pool.

The pg query of 7.118 looks as follows:

{
    "state": "creating+activating",
    "snap_trimq": "[1~5]",
    "snap_trimq_len": 5,
    "epoch": 5016,
    "up": [
        12,
        23,
        25
    ],
    "acting": [
        12,
        23,
        25
    ],
    "actingbackfill": [
        "12",
        "23",
        "25"
    ],
    "info": {
        "pgid": "7.118",
        "last_update": "0'0",
        "last_complete": "0'0",
        "log_tail": "0'0",
        "last_user_version": 0,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 0,
        "purged_snaps": [],
        "history": {
            "epoch_created": 4620,
            "epoch_pool_created": 4620,
            "last_epoch_started": 0,
            "last_interval_started": 0,
            "last_epoch_clean": 0,
            "last_interval_clean": 0,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 4967,
            "same_interval_since": 4967,
            "same_primary_since": 4967,
            "last_scrub": "0'0",
            "last_scrub_stamp": "2018-03-15 07:18:46.197892",
            "last_deep_scrub": "0'0",
            "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892",
            "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892"
        },
        "stats": {
            "version": "0'0",
            "reported_seq": "406",
            "reported_epoch": "5016",
            "state": "creating+activating",
            "last_fresh": "2018-03-17 10:12:58.380048",
            "last_change": "2018-03-17 10:10:24.335405",
            "last_active": "2018-03-15 07:18:46.197892",
            "last_peered": "2018-03-15 07:18:46.197892",
            "last_clean": "2018-03-15 07:18:46.197892",
            "last_became_active": "0.000000",
            "last_became_peered": "0.000000",
            "last_unstale": "2018-03-17 10:12:58.380048",
            "last_undegraded": "2018-03-17 10:12:58.380048",
            "last_fullsized": "2018-03-17 10:12:58.380048",
            "mapping_epoch": 4967,
            "log_start": "0'0",
            "ondisk_log_start": "0'0",
            "created": 4620,
            "last_epoch_clean": 0,
            "parent": "0.0",
            "parent_split_bits": 0,
            "last_scrub": "0'0",
            "last_scrub_stamp": "2018-03-15 07:18:46.197892",
            "last_deep_scrub": "0'0",
            "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892",
            "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892",
            "log_size": 0,
            "ondisk_log_size": 0,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "snaptrimq_len": 5,
            "stat_sum": {
                "num_bytes": 0,
                "num_objects": 0,
                "num_object_clones": 0,
                "num_object_copies": 0,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 0,
                "num_whiteouts": 0,
                "num_read": 0,
                "num_read_kb": 0,
                "num_write": 0,
                "num_write_kb": 0,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 0,
                "num_bytes_recovered": 0,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0
            },
            "up": [
                12,
                23,
                25
            ],
            "acting": [
                12,
                23,
                25
            ],
            "blocked_by": [],
            "up_primary": 12,
            "acting_primary": 12
        },
        "empty": 1,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 4968,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [
        {
            "peer": "23",
            "pgid": "7.118",
            "last_update": "0'0",
            "last_complete": "0'0",
            "log_tail": "0'0",
            "last_user_version": 0,
            "last_backfill": "MAX",
            "last_backfill_bitwise": 0,
            "purged_snaps": [],
            "history": {
                "epoch_created": 0,
                "epoch_pool_created": 0,
                "last_epoch_started": 0,
                "last_interval_started": 0,
                "last_epoch_clean": 0,
                "last_interval_clean": 0,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 0,
                "same_interval_since": 0,
                "same_primary_since": 0,
                "last_scrub": "0'0",
                "last_scrub_stamp": "0.000000",
                "last_deep_scrub": "0'0",
                "last_deep_scrub_stamp": "0.000000",
                "last_clean_scrub_stamp": "0.000000"
            },
            "stats": {
                "version": "0'0",
                "reported_seq": "0",
                "reported_epoch": "0",
                "state": "unknown",
                "last_fresh": "0.000000",
                "last_change": "0.000000",
                "last_active": "0.000000",
                "last_peered": "0.000000",
                "last_clean": "0.000000",
                "last_became_active": "0.000000",
                "last_became_peered": "0.000000",
                "last_unstale": "0.000000",
                "last_undegraded": "0.000000",
                "last_fullsized": "0.000000",
                "mapping_epoch": 0,
                "log_start": "0'0",
                "ondisk_log_start": "0'0",
                "created": 0,
                "last_epoch_clean": 0,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "0'0",
                "last_scrub_stamp": "0.000000",
                "last_deep_scrub": "0'0",
                "last_deep_scrub_stamp": "0.000000",
                "last_clean_scrub_stamp": "0.000000",
                "log_size": 0,
                "ondisk_log_size": 0,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "snaptrimq_len": 0,
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 0,
                    "num_object_clones": 0,
                    "num_object_copies": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_whiteouts": 0,
                    "num_read": 0,
                    "num_read_kb": 0,
                    "num_write": 0,
                    "num_write_kb": 0,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 0,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0,
                    "num_legacy_snapsets": 0
                },
                "up": [],
                "acting": [],
                "blocked_by": [],
                "up_primary": -1,
                "acting_primary": -1
            },
            "empty": 1,
            "dne": 1,
            "incomplete": 0,
            "last_epoch_started": 0,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        },
        {
            "peer": "24",
            "pgid": "7.118",
            "last_update": "0'0",
            "last_complete": "0'0",
            "log_tail": "0'0",
            "last_user_version": 0,
            "last_backfill": "MAX",
            "last_backfill_bitwise": 0,
            "purged_snaps": [],
            "history": {
                "epoch_created": 4620,
                "epoch_pool_created": 4620,
                "last_epoch_started": 0,
                "last_interval_started": 0,
                "last_epoch_clean": 0,
                "last_interval_clean": 0,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 4967,
                "same_interval_since": 4967,
                "same_primary_since": 4967,
                "last_scrub": "0'0",
                "last_scrub_stamp": "2018-03-15 07:18:46.197892",
                "last_deep_scrub": "0'0",
                "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892",
                "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892"
            },
            "stats": {
                "version": "0'0",
                "reported_seq": "164",
                "reported_epoch": "4769",
                "state": "creating+remapped+peering",
                "last_fresh": "2018-03-16 23:49:04.258780",
                "last_change": "2018-03-16 23:49:03.296077",
                "last_active": "2018-03-15 07:18:46.197892",
                "last_peered": "2018-03-15 07:18:46.197892",
                "last_clean": "2018-03-15 07:18:46.197892",
                "last_became_active": "0.000000",
                "last_became_peered": "0.000000",
                "last_unstale": "2018-03-16 23:49:04.258780",
                "last_undegraded": "2018-03-16 23:49:04.258780",
                "last_fullsized": "2018-03-16 23:49:04.258780",
                "mapping_epoch": 4967,
                "log_start": "0'0",
                "ondisk_log_start": "0'0",
                "created": 4620,
                "last_epoch_clean": 0,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "0'0",
                "last_scrub_stamp": "2018-03-15 07:18:46.197892",
                "last_deep_scrub": "0'0",
                "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892",
                "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892",
                "log_size": 0,
                "ondisk_log_size": 0,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "snaptrimq_len": 0,
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 0,
                    "num_object_clones": 0,
                    "num_object_copies": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_whiteouts": 0,
                    "num_read": 0,
                    "num_read_kb": 0,
                    "num_write": 0,
                    "num_write_kb": 0,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 0,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0,
                    "num_legacy_snapsets": 0
                },
                "up": [
                    12,
                    23,
                    25
                ],
                "acting": [
                    12,
                    23,
                    25
                ],
                "blocked_by": [],
                "up_primary": 12,
                "acting_primary": 12
            },
            "empty": 1,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 4769,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        },
        {
            "peer": "25",
            "pgid": "7.118",
            "last_update": "0'0",
            "last_complete": "0'0",
            "log_tail": "0'0",
            "last_user_version": 0,
            "last_backfill": "MAX",
            "last_backfill_bitwise": 0,
            "purged_snaps": [],
            "history": {
                "epoch_created": 0,
                "epoch_pool_created": 0,
                "last_epoch_started": 0,
                "last_interval_started": 0,
                "last_epoch_clean": 0,
                "last_interval_clean": 0,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 0,
                "same_interval_since": 0,
                "same_primary_since": 0,
                "last_scrub": "0'0",
                "last_scrub_stamp": "0.000000",
                "last_deep_scrub": "0'0",
                "last_deep_scrub_stamp": "0.000000",
                "last_clean_scrub_stamp": "0.000000"
            },
            "stats": {
                "version": "0'0",
                "reported_seq": "0",
                "reported_epoch": "0",
                "state": "unknown",
                "last_fresh": "0.000000",
                "last_change": "0.000000",
                "last_active": "0.000000",
                "last_peered": "0.000000",
                "last_clean": "0.000000",
                "last_became_active": "0.000000",
                "last_became_peered": "0.000000",
                "last_unstale": "0.000000",
                "last_undegraded": "0.000000",
                "last_fullsized": "0.000000",
                "mapping_epoch": 0,
                "log_start": "0'0",
                "ondisk_log_start": "0'0",
                "created": 0,
                "last_epoch_clean": 0,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "0'0",
                "last_scrub_stamp": "0.000000",
                "last_deep_scrub": "0'0",
                "last_deep_scrub_stamp": "0.000000",
                "last_clean_scrub_stamp": "0.000000",
                "log_size": 0,
                "ondisk_log_size": 0,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "snaptrimq_len": 0,
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 0,
                    "num_object_clones": 0,
                    "num_object_copies": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_whiteouts": 0,
                    "num_read": 0,
                    "num_read_kb": 0,
                    "num_write": 0,
                    "num_write_kb": 0,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 0,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0,
                    "num_legacy_snapsets": 0
                },
                "up": [],
                "acting": [],
                "blocked_by": [],
                "up_primary": -1,
                "acting_primary": -1
            },
            "empty": 1,
            "dne": 1,
            "incomplete": 0,
            "last_epoch_started": 0,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        }
    ],
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2018-03-17 10:10:24.335124",
            "might_have_unfound": [
                {
                    "osd": "24",
                    "status": "not queried"
                }
            ],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "pull_from_peer": [],
                    "pushing": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "0",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.seed": 0,
                "scrubber.waiting_on": 0,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2018-03-17 10:10:23.373097"
        }
    ],
    "agent_state": {}
}


If anyone has a hint on why it is stuck in creation, it would be very
much appreciated.

Best,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux