Hi Jake, I suspect you have hit an issue that me and a few others have hit in Luminous. By increasing the number of PG's before all the data has re-balanced, you have probably exceeded hard PG per OSD limit. See this thread https://www.spinics.net/lists/ceph-users/msg41231.html Nick > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Jake Grimmett > Sent: 29 January 2018 12:46 > To: ceph-users@xxxxxxxxxxxxxx > Subject: pgs down after adding 260 OSDs & increasing PGs > > Dear All, > > Our ceph luminous (12.2.2) cluster has just broken, due to either adding > 260 OSDs drives in one go, or to increasing the PG number from 1024 to > 4096 in one go, or a combination of both... > > Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes running > SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) & 64GB ram. > > The cluster has just two pools; > 1) 1024 pg/pgs 8+2 EC pool on 190 x hdd. > 2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool. > > Cluster provides 500TB CephFS used for scratch space, four snapshots taken > daily and kept for one week only. > > Everything was working perfectly, until 26 OSD's were added to each node, > bringing the total hdd OSD count to 450. (all 8TB Ironwolf) > > After adding all 260 OSD's with ceph-deploy, ceph health shows > > HEALTH_WARN noout flag(s) set; > 732950716/1219068139 objects misplaced (60.124%); Degraded data > redundancy: 1024 pgs unclean; too few PGs per OSD (23 < min 30) > > So far so good, I'd expected to see the cluster rebalancing, the complaint about > too few pgs per OSD seemed reasonable. > > Without waiting for the cluster to rebalance, I increased the pg/pgs to 4096. At > this point, ceph health showed this: > > HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%); Reduced > data availability: 3119 pgs inactive; Degraded data redundancy: > 210609/1219068139 objects degraded (0.017%), > 4088 pgs unclean, 1002 pgs degraded, > 1002 pgs undersized; 5 stuck requests are blocked > 4096 sec > > We then left the cluster to rebalance. > > Next morning, two ceph nodes were down, and I could see lots of oom-killer > messages in the logs. > Each node only has 64GB for 45 OSD's which is probably the cause of this. > > as a short term fix, we limited RAM usage by adding this to ceph.conf > bluestore_cache_size = 104857600 bluestore_cache_kv_max = 67108864 > > This appears to stop the oom problems, so we waited while the cluster > rebalanced, until it said it stopped saying "objects misplaced" > This took a couple of days... > > The problem now, is that although all of the OSD's are up, lots of pgs are down, > degraded, unclean, and it is not clear how to fix this. > > I have tried issuing osd scrub, and pg repair commands, but these do not appear > to do anything. > > cephfs will mount, but when locks up when it hits a pg that is down. > > I have tried sequentially restarting all OSD's on each node, slowly walking > through the cluster several times, but this does not fix things. > > Current Status: > # ceph health > HEALTH_ERR Reduced data availability: > 3021 pgs inactive, 23 pgs down, 23 pgs stale; Degraded data redundancy: 3021 > pgs unclean, 1879 pgs degraded, > 1879 pgs undersized; 1 stuck requests are blocked > 4096 sec > > ceph health detail (see http://p.ip.fi/Pwdb ) contains many lines such as: > > pg 4.ffe is stuck unclean for 470551.768849, current state > activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178] > > pg 4.fff is stuck undersized for 49509.580577, current state > activating+undersized+degraded+remapped, last acting > [44,12,185,125,69,29,119,102,81,2147483647] > > (Presumably the OSD number "2147483647" is due to Erasure Encoding, > as per > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015- > May/001660.html> > ?) > > Tailing the stuck osd log with debug osd = 20 shows this: > > 2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer > 0x5647cb336800 already has epoch 15482 > 2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482 > tick_without_osd_lock > 2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482 > scrub_random_backoff lost coin flip, randomly backing off > 2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482 > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 > bytes; target 25 obj/sec or 5120 k bytes/sec > 2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482 > promote_throttle_recalibrate new_prob 1000 > 2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482 > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted > new_prob 1000, prob 1000 -> 1000 > 2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer > 0x5647cabf3800 already has epoch 15482 > > Currently this cluster is just storing scratch data, so could be wiped, > however we would be more confident about using ceph widely if we can fix > errors like this... > > thanks for reading, any advice appreciated, > > Jake > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com