My first guess would be PG overdose protection kicked in [1][2]
You can try fixing it by increasing allowed number of PG per OSD with
ceph tell mon.* injectargs '--mon_max_pg_per_osd 500'
ceph tell osd.* injectargs '--mon_max_pg_per_osd 500'
and then triggering CRUSH algorithm update by restarting an OSD for example.
2018-03-17 12:15 GMT+03:00 Nico Schottelius <nico.schottelius@xxxxxxxxxxx>:
Good morning,
some days ago we created a new pool with 512 pgs, and originally 5 osds.
We use the device class "ssd" and a crush rule that maps all data for
the pool "ssd" to the ssd device class osds.
While creating, one of the ssds failed and we are left with 4 osds:
[10:00:22] server2.place6:/var/log/ceph# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 135.12505 root default
-7 51.36911 host server2
15 hdd-big 9.09511 osd.15 up 1.00000 1.00000
20 hdd-big 9.09511 osd.20 up 1.00000 1.00000
21 hdd-big 9.09511 osd.21 up 1.00000 1.00000
7 hdd-small 4.54776 osd.7 up 1.00000 1.00000
8 hdd-small 4.54776 osd.8 up 1.00000 1.00000
10 hdd-small 4.54776 osd.10 up 1.00000 1.00000
26 hdd-small 4.54776 osd.26 up 1.00000 1.00000
14 notinuse 5.45741 osd.14 up 1.00000 1.00000
12 ssd 0.21767 osd.12 up 1.00000 1.00000
24 ssd 0.21767 osd.24 up 1.00000 1.00000
-5 42.50967 host server3
9 hdd-big 9.09511 osd.9 up 1.00000 1.00000
16 hdd-big 9.09511 osd.16 up 1.00000 1.00000
19 hdd-big 9.09511 osd.19 up 1.00000 1.00000
3 hdd-small 4.54776 osd.3 up 1.00000 1.00000
5 hdd-small 4.54776 osd.5 up 1.00000 1.00000
6 hdd-small 4.54776 osd.6 up 1.00000 1.00000
11 notinuse 0.45424 osd.11 up 1.00000 1.00000
13 notinuse 0.90907 osd.13 up 1.00000 1.00000
25 ssd 0.21776 osd.25 up 1.00000 1.00000
-2 41.24626 host server4
2 hdd-big 9.09511 osd.2 up 1.00000 1.00000
17 hdd-big 9.09511 osd.17 up 1.00000 1.00000
18 hdd-big 9.09511 osd.18 up 1.00000 1.00000
0 hdd-small 4.54776 osd.0 up 1.00000 1.00000
1 hdd-small 4.54776 osd.1 up 1.00000 1.00000
22 hdd-small 4.54776 osd.22 up 1.00000 1.00000
4 notinuse 0.09999 osd.4 up 1.00000 1.00000
23 ssd 0.21767 osd.23 up 1.00000 1.00000
[10:04:27] server2.place6:/var/log/ceph#
We first had about 160 pgs stuck in creating+activating. After
restarting all osds in the ssd class one by one, it shifted to
100 activating and 60 creating+activating:
[10:00:18] server2.place6:/var/log/ceph# ceph -s
cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_ERR
1803200/13770981 objects misplaced (13.094%)
Reduced data availability: 175 pgs inactive
Degraded data redundancy: 857547/13770981 objects degraded (6.227%), 197 pgs degraded, 123 pgs undersized
39 slow requests are blocked > 32 sec
40 stuck requests are blocked > 4096 sec
services:
mon: 3 daemons, quorum black1,black2,black3
mgr: black3(active), standbys: black2, black1
osd: 27 osds: 27 up, 27 in; 156 remapped pgs
data:
pools: 2 pools, 1024 pgs
objects: 4482k objects, 17725 GB
usage: 55542 GB used, 83188 GB / 135 TB avail
pgs: 17.090% pgs not active
857547/13770981 objects degraded (6.227%)
1803200/13770981 objects misplaced (13.094%)
640 active+clean
105 active+undersized+degraded+remapped+backfill_wait
100 activating
60 creating+activating
50 active+recovery_wait+degraded
21 active+remapped+backfill_wait
16 active+recovery_wait+undersized+degraded+remapped
15 activating+degraded
9 active+recovery_wait+degraded+remapped
3 active+recovery_wait+remapped
3 active+recovery_wait
2 active+undersized+degraded+remapped+backfilling
io:
client: 519 kB/s rd, 38025 kB/s wr, 4 op/s rd, 20 op/s wr
recovery: 1694 kB/s, 0 objects/s
I looked into the archives, but did not find anything that directly
related to our situation. We are using ceph 12.2.4.
An excerpt from our ceph health detail looks like this:
HEALTH_ERR 1803116/13770981 objects misplaced (13.094%); Reduced data availability: 175 pgs inactive; Degraded data redundancy: 856881/13770981 objects degraded (6.222%), 197 pgs degraded, 123 pgs undersized; 53 slow requests are blocked > 32 sec; 40 stuck requests are blocked > 4096 sec
OBJECT_MISPLACED 1803116/13770981 objects misplaced (13.094%)
PG_AVAILABILITY Reduced data availability: 175 pgs inactive
pg 7.118 is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25]
pg 7.11a is stuck inactive for 38143.679989, current state activating, last acting [25,24,23]
pg 7.121 is stuck inactive for 38143.670149, current state activating, last acting [25,23,12]
pg 7.123 is stuck inactive for 37184.100764, current state activating+degraded, last acting [25,12,23]
pg 7.125 is stuck inactive for 38143.677390, current state activating, last acting [25,24,23]
pg 7.126 is stuck inactive for 38164.127082, current state activating, last acting [24,23,25]
pg 7.127 is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25]
pg 7.12b is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25]
where pool 7 is the ssd pool.
The pg query of 7.118 looks as follows:
{
"state": "creating+activating",
"snap_trimq": "[1~5]",
"snap_trimq_len": 5,
"epoch": 5016,
"up": [
12,
23,
25
],
"acting": [
12,
23,
25
],
"actingbackfill": [
"12",
"23",
"25"
],
"info": {
"pgid": "7.118",
"last_update": "0'0",
"last_complete": "0'0",
"log_tail": "0'0",
"last_user_version": 0,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 4620,
"epoch_pool_created": 4620,
"last_epoch_started": 0,
"last_interval_started": 0,
"last_epoch_clean": 0,
"last_interval_clean": 0,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 4967,
"same_interval_since": 4967,
"same_primary_since": 4967,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-03-15 07:18:46.197892",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-03-15 07:18:46.197892",
"last_clean_scrub_stamp": "2018-03-15 07:18:46.197892"
},
"stats": {
"version": "0'0",
"reported_seq": "406",
"reported_epoch": "5016",
"state": "creating+activating",
"last_fresh": "2018-03-17 10:12:58.380048",
"last_change": "2018-03-17 10:10:24.335405",
"last_active": "2018-03-15 07:18:46.197892",
"last_peered": "2018-03-15 07:18:46.197892",
"last_clean": "2018-03-15 07:18:46.197892",
"last_became_active": "0.000000",
"last_became_peered": "0.000000",
"last_unstale": "2018-03-17 10:12:58.380048",
"last_undegraded": "2018-03-17 10:12:58.380048",
"last_fullsized": "2018-03-17 10:12:58.380048",
"mapping_epoch": 4967,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 4620,
"last_epoch_clean": 0,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-03-15 07:18:46.197892",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-03-15 07:18:46.197892",
"last_clean_scrub_stamp": "2018-03-15 07:18:46.197892",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"snaptrimq_len": 5,
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
12,
23,
25
],
"acting": [
12,
23,
25
],
"blocked_by": [],
"up_primary": 12,
"acting_primary": 12
},
"empty": 1,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 4968,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
"peer_info": [
{
"peer": "23",
"pgid": "7.118",
"last_update": "0'0",
"last_complete": "0'0",
"log_tail": "0'0",
"last_user_version": 0,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 0,
"epoch_pool_created": 0,
"last_epoch_started": 0,
"last_interval_started": 0,
"last_epoch_clean": 0,
"last_interval_clean": 0,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 0,
"same_interval_since": 0,
"same_primary_since": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "0.000000",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "0.000000",
"last_clean_scrub_stamp": "0.000000"
},
"stats": {
"version": "0'0",
"reported_seq": "0",
"reported_epoch": "0",
"state": "unknown",
"last_fresh": "0.000000",
"last_change": "0.000000",
"last_active": "0.000000",
"last_peered": "0.000000",
"last_clean": "0.000000",
"last_became_active": "0.000000",
"last_became_peered": "0.000000",
"last_unstale": "0.000000",
"last_undegraded": "0.000000",
"last_fullsized": "0.000000",
"mapping_epoch": 0,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 0,
"last_epoch_clean": 0,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "0.000000",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "0.000000",
"last_clean_scrub_stamp": "0.000000",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [],
"acting": [],
"blocked_by": [],
"up_primary": -1,
"acting_primary": -1
},
"empty": 1,
"dne": 1,
"incomplete": 0,
"last_epoch_started": 0,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
{
"peer": "24",
"pgid": "7.118",
"last_update": "0'0",
"last_complete": "0'0",
"log_tail": "0'0",
"last_user_version": 0,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 4620,
"epoch_pool_created": 4620,
"last_epoch_started": 0,
"last_interval_started": 0,
"last_epoch_clean": 0,
"last_interval_clean": 0,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 4967,
"same_interval_since": 4967,
"same_primary_since": 4967,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-03-15 07:18:46.197892",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-03-15 07:18:46.197892",
"last_clean_scrub_stamp": "2018-03-15 07:18:46.197892"
},
"stats": {
"version": "0'0",
"reported_seq": "164",
"reported_epoch": "4769",
"state": "creating+remapped+peering",
"last_fresh": "2018-03-16 23:49:04.258780",
"last_change": "2018-03-16 23:49:03.296077",
"last_active": "2018-03-15 07:18:46.197892",
"last_peered": "2018-03-15 07:18:46.197892",
"last_clean": "2018-03-15 07:18:46.197892",
"last_became_active": "0.000000",
"last_became_peered": "0.000000",
"last_unstale": "2018-03-16 23:49:04.258780",
"last_undegraded": "2018-03-16 23:49:04.258780",
"last_fullsized": "2018-03-16 23:49:04.258780",
"mapping_epoch": 4967,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 4620,
"last_epoch_clean": 0,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-03-15 07:18:46.197892",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-03-15 07:18:46.197892",
"last_clean_scrub_stamp": "2018-03-15 07:18:46.197892",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
12,
23,
25
],
"acting": [
12,
23,
25
],
"blocked_by": [],
"up_primary": 12,
"acting_primary": 12
},
"empty": 1,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 4769,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
{
"peer": "25",
"pgid": "7.118",
"last_update": "0'0",
"last_complete": "0'0",
"log_tail": "0'0",
"last_user_version": 0,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 0,
"epoch_pool_created": 0,
"last_epoch_started": 0,
"last_interval_started": 0,
"last_epoch_clean": 0,
"last_interval_clean": 0,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 0,
"same_interval_since": 0,
"same_primary_since": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "0.000000",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "0.000000",
"last_clean_scrub_stamp": "0.000000"
},
"stats": {
"version": "0'0",
"reported_seq": "0",
"reported_epoch": "0",
"state": "unknown",
"last_fresh": "0.000000",
"last_change": "0.000000",
"last_active": "0.000000",
"last_peered": "0.000000",
"last_clean": "0.000000",
"last_became_active": "0.000000",
"last_became_peered": "0.000000",
"last_unstale": "0.000000",
"last_undegraded": "0.000000",
"last_fullsized": "0.000000",
"mapping_epoch": 0,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 0,
"last_epoch_clean": 0,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "0.000000",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "0.000000",
"last_clean_scrub_stamp": "0.000000",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [],
"acting": [],
"blocked_by": [],
"up_primary": -1,
"acting_primary": -1
},
"empty": 1,
"dne": 1,
"incomplete": 0,
"last_epoch_started": 0,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}
],
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-03-17 10:10:24.335124",
"might_have_unfound": [
{
"osd": "24",
"status": "not queried"
}
],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-03-17 10:10:23.373097"
}
],
"agent_state": {}
}
If anyone has a hint on why it is stuck in creation, it would be very
much appreciated.
Best,
Nico
--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com