Re: pgs down after adding 260 OSDs & increasing PGs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jake,

I suspect you have hit an issue that me and a few others have hit in
Luminous. By increasing the number of PG's before all the data has
re-balanced, you have probably exceeded hard PG per OSD limit.

See this thread
https://www.spinics.net/lists/ceph-users/msg41231.html

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Jake Grimmett
> Sent: 29 January 2018 12:46
> To: ceph-users@xxxxxxxxxxxxxx
> Subject:  pgs down after adding 260 OSDs & increasing PGs
> 
> Dear All,
> 
> Our ceph luminous (12.2.2) cluster has just broken, due to either adding
> 260 OSDs drives in one go, or to increasing the PG number from 1024 to
> 4096 in one go, or a combination of both...
> 
> Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes
running
> SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) & 64GB ram.
> 
> The cluster has just two pools;
> 1) 1024 pg/pgs 8+2 EC pool on 190 x hdd.
> 2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool.
> 
> Cluster provides 500TB CephFS used for scratch space, four snapshots taken
> daily and kept for one week only.
> 
> Everything was working perfectly, until 26 OSD's were added to each node,
> bringing the total hdd OSD count to 450. (all 8TB Ironwolf)
> 
> After adding all 260 OSD's with ceph-deploy, ceph health shows
> 
> HEALTH_WARN noout flag(s) set;
> 732950716/1219068139 objects misplaced (60.124%); Degraded data
> redundancy: 1024 pgs unclean; too few PGs per OSD (23 < min 30)
> 
> So far so good, I'd expected to see the cluster rebalancing, the complaint
about
> too few pgs per OSD seemed reasonable.
> 
> Without waiting for the cluster to rebalance, I increased the pg/pgs to
4096. At
> this point, ceph health showed this:
> 
> HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%); Reduced
> data availability: 3119 pgs inactive; Degraded data redundancy:
> 210609/1219068139 objects degraded (0.017%),
> 4088 pgs unclean, 1002 pgs degraded,
> 1002 pgs undersized; 5 stuck requests are blocked > 4096 sec
> 
> We then left the cluster to rebalance.
> 
> Next morning, two ceph nodes were down, and I could see lots of oom-killer
> messages in the logs.
> Each node only has 64GB for 45 OSD's which is probably the cause of this.
> 
> as a short term fix, we limited RAM usage by adding this to ceph.conf
> bluestore_cache_size = 104857600 bluestore_cache_kv_max = 67108864
> 
> This appears to stop the oom problems, so we waited while the cluster
> rebalanced, until it said it stopped saying "objects misplaced"
> This took a couple of days...
> 
> The problem now, is that although all of the OSD's are up, lots of pgs are
down,
> degraded, unclean, and it is not clear how to fix this.
> 
> I have tried issuing osd scrub, and pg repair commands, but these do not
appear
> to do anything.
> 
> cephfs will mount, but when locks up when it hits a pg that is down.
> 
> I have tried sequentially restarting all OSD's on each node, slowly
walking
> through the cluster several times, but this does not fix things.
> 
> Current Status:
> # ceph health
> HEALTH_ERR Reduced data availability:
> 3021 pgs inactive, 23 pgs down, 23 pgs stale; Degraded data redundancy:
3021
> pgs unclean, 1879 pgs degraded,
> 1879 pgs undersized; 1 stuck requests are blocked > 4096 sec
> 
> ceph health detail (see http://p.ip.fi/Pwdb ) contains many lines such as:
> 
> pg 4.ffe is stuck unclean for 470551.768849, current state
> activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178]
> 
> pg 4.fff is stuck undersized for 49509.580577, current state
> activating+undersized+degraded+remapped, last acting
> [44,12,185,125,69,29,119,102,81,2147483647]
> 
> (Presumably the OSD number "2147483647" is due to Erasure Encoding,
> as per
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
> May/001660.html>
> ?)
> 
> Tailing the stuck osd log with debug osd = 20 shows this:
> 
> 2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer
> 0x5647cb336800 already has epoch 15482
> 2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482
> tick_without_osd_lock
> 2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482
> scrub_random_backoff lost coin flip, randomly backing off
> 2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482
> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0
> bytes; target 25 obj/sec or 5120 k bytes/sec
> 2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482
> promote_throttle_recalibrate  new_prob 1000
> 2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482
> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
> new_prob 1000, prob 1000 -> 1000
> 2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer
> 0x5647cabf3800 already has epoch 15482
> 
> Currently this cluster is just storing scratch data, so could be wiped,
> however we would be more confident about using ceph widely if we can fix
> errors like this...
> 
> thanks for reading, any advice appreciated,
> 
> Jake
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux