"340 osds: 101 up, 112 in" This is going to be your culprit. Your CRUSH map is in a really weird state. How many OSDs do you have in this cluster? When OSDs go down, secondary OSDs take over for it, but when OSDs get marked out, the cluster re-balances to distribute the data according to how many replicas your settings say it should have (remapped PGs). Your cluster thinks it has 340 OSDs in total, it believes that 112 of them are added to the cluster, but only 101 of them are currently up and running. That means that it is trying to put all of your data onto those 101 OSDs. Your settings to have 16k PGs is fine for 340 OSDs, but with 101 OSDs you're getting the error of too many PGs per OSD.
So next steps:
1) How many OSDs do you expect to be in your Ceph cluster?
2) Did you bring your OSDs back up during your rolling restart testing BEFORE
a) They were marked down in the cluster?
b) You moved onto the next node? Additionally, did you wait for all backfilling to finish before you proceeded to the next node?
3) Do you have enough memory in your nodes or are your OSDs being killed by OOM killer? I see that you have a lot of peering PGs in your status output. That is indicative that the OSDs are continually restarting or being marked down for not responding.
On Thu, May 18, 2017 at 2:41 PM nokia ceph <nokiacephusers@xxxxxxxxx> wrote:
Hello,_______________________________________________Env;- Bluestore EC 4+1 v11.2.0 RHEL7.3 16383 PGWe did our resiliency testing and found OSD's keeps on flapping and cluster went to error state.What we did:-1. we have 5 node cluster2. poweroff/stop ceph.target on last node and waited everything seems to reach back to normal.3. Then power up the last node and then we see this recovery stuck on remapped PG.~~~osdmap e4829: 340 osds: 101 up, 112 in; 15011 remapped pgs~~~4. Initially all osd's reach 340, at the same time this remapped value reached 16384 with OSD epoch value e8185. Then after 1 or 2 hour we suspect that this remapped PG value keeps on incremnet/decrement results the osd's started failed one by one. While we tested with below patch also still no change.#ceph -s2017-05-18 18:07:45.876586 7fd6bb87e700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb2017-05-18 18:07:45.900045 7fd6bb87e700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdbcluster cb55baa8-d5a5-442e-9aae-3fd83553824ehealth HEALTH_ERR27056 pgs are stuck inactive for more than 300 seconds744 pgs degraded10944 pgs down3919 pgs peering11416 pgs stale744 pgs stuck degraded15640 pgs stuck inactive11416 pgs stuck stale16384 pgs stuck unclean744 pgs stuck undersized744 pgs undersizedrecovery 1279809/135206985 objects degraded (0.947%)too many PGs per OSD (731 > max 300)11/112 in osds are downmonmap e3: 5 mons at {PL6-CN1=10.50.62.151:6789/0,PL6-CN2=10.50.62.152:6789/0,PL6-CN3=10.50.62.153:6789/0,PL6-CN4=10.50.62.154:6789/0,PL6-CN5=1election epoch 22, quorum 0,1,2,3,4 PL6-CN1,PL6-CN2,PL6-CN3,PL6-CN4,PL6-CN5mgr no daemons activeosdmap e4827: 340 osds: 101 up, 112 in; 15011 remapped pgsflags sortbitwise,require_jewel_osds,require_kraken_osdspgmap v83202: 16384 pgs, 1 pools, 52815 GB data, 26407 kobjects12438 GB used, 331 TB / 343 TB avail1279809/135206985 objects degraded (0.947%)4512 stale+down+remapped3060 down+remapped2204 stale+down2000 stale+remapped+peering1259 stale+peering1167 down739 stale+active+undersized+degraded702 stale+remapped557 peering102 remapped+peering# ceph pg stat2017-05-18 18:09:18.345865 7fe2f72ec700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb2017-05-18 18:09:18.368566 7fe2f72ec700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdbv83204: 16384 pgs: 1 inactive, 1259 stale+peering, 75 remapped, 2000 stale+remapped+peering, 102 remapped+peering, 2204 stale+down, 739 stale+active+undersized+degraded, 1 down+remapped+peering, 702 stale+remapped, 557 peering, 4512 stale+down+remapped, 3060 down+remapped, 5 active+undersized+degraded, 1167 down; 52815 GB data, 12438 GB used, 331 TB / 343 TB avail; 1279809/135206985 objects degraded (0.947%)Randomly capture some pg value.~~~3.3ffc 1646 0 1715 0 0 3451912192 1646 1646 stale+active+undersized+degraded 2017-05-18 11:06:32.453158 846'1646 872:1634 [36,NONE,278,219,225] 36 [36,NONE,278,219,225] 36 0'0 2017-05-18 07:14:30.303859 0'0 2017-05-18 07:14:30.3038593.3ffb 1711 0 0 0 0 3588227072 1711 1711 down 2017-05-18 15:20:52.858840 846'1711 1602:1708 [150,161,NONE,NONE,83] 150 [150,161,NONE,NONE,83] 150 0'0 2017-05-18 07:14:30.303838 0'0 2017-05-18 07:14:30.3038383.3ffa 1617 0 0 0 0 3391094784 1617 1617 down+remapped 2017-05-18 17:12:54.943317 846'1617 2525:1637 [48,292,77,277,49] 48 [48,NONE,NONE,277,49] 48 0'0 2017-05-18 07:14:30.303807 0'0 2017-05-18 07:14:30.3038073.3ff9 1682 0 0 0 0 3527409664 1682 1682 down+remapped 2017-05-18 16:16:42.223632 846'1682 2195:1678 [266,79,NONE,309,258] 266 [NONE,NONE,NONE,NONE,258] 258 0'0 2017-05-18 07:14:30.303793 0'0 2017-05-18 07:14:30.303793~~~ceph.conf[mon]mon_osd_down_out_interval = 3600mon_osd_reporter_subtree_level=hostmon_osd_down_out_subtree_limit=hostmon_osd_min_down_reporters = 4mon_allow_pool_delete = true[osd]bluestore = truebluestore_cache_size = 107374182bluefs_buffered_io = trueosd_op_threads = 24osd_op_num_shards = 5osd_op_num_threads_per_shard = 2osd_enable_op_tracker = falseosd_scrub_begin_hour = 1osd_scrub_end_hour = 7osd_deep_scrub_interval = 3.154e+9osd_max_backfills = 3osd_recovery_max_active = 3osd_recovery_op_priority = 1# ceph osd stat2017-05-18 18:10:11.864303 7fedc5a98700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb2017-05-18 18:10:11.887182 7fedc5a98700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdbosdmap e4829: 340 osds: 101 up, 112 in; 15011 remapped pgs ==<<< <<<<<<<<<<<<<< SEE thisflags sortbitwise,require_jewel_osds,require_kraken_osdsIs there any config directive which helps to skip the remapped PG count while recovery process. Does Luminous v12.0.3 fixed the OSD flap issue?Awaiting for your suggestions.ThanksJayaram
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com