pgs down after adding 260 OSDs & increasing PGs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear All,

Our ceph luminous (12.2.2) cluster has just broken, due to either adding 260 OSDs drives in one go, or to increasing the PG number from 1024 to 4096 in one go, or a combination of both...

Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes running SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) & 64GB ram.

The cluster has just two pools;
1) 1024 pg/pgs 8+2 EC pool on 190 x hdd.
2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool.

Cluster provides 500TB CephFS used for scratch space, four snapshots taken daily and kept for one week only.

Everything was working perfectly, until 26 OSD's were added to each node, bringing the total hdd OSD count to 450. (all 8TB Ironwolf)

After adding all 260 OSD's with ceph-deploy, ceph health shows

HEALTH_WARN noout flag(s) set;
732950716/1219068139 objects misplaced (60.124%);
Degraded data redundancy: 1024 pgs unclean;
too few PGs per OSD (23 < min 30)

So far so good, I'd expected to see the cluster rebalancing, the complaint about too few pgs per OSD seemed reasonable.

Without waiting for the cluster to rebalance, I increased the pg/pgs to 4096. At this point, ceph health showed this:

HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%);
Reduced data availability: 3119 pgs inactive;
Degraded data redundancy: 210609/1219068139 objects degraded (0.017%), 4088 pgs unclean, 1002 pgs degraded,
1002 pgs undersized; 5 stuck requests are blocked > 4096 sec

We then left the cluster to rebalance.

Next morning, two ceph nodes were down, and I could see lots of oom-killer messages in the logs.
Each node only has 64GB for 45 OSD's which is probably the cause of this.

as a short term fix, we limited RAM usage by adding this to ceph.conf
bluestore_cache_size = 104857600
bluestore_cache_kv_max = 67108864

This appears to stop the oom problems, so we waited while the cluster rebalanced, until it said it stopped saying "objects misplaced"
This took a couple of days...

The problem now, is that although all of the OSD's are up, lots of pgs are down, degraded, unclean, and it is not clear how to fix this.

I have tried issuing osd scrub, and pg repair commands, but these do not appear to do anything.

cephfs will mount, but when locks up when it hits a pg that is down.

I have tried sequentially restarting all OSD's on each node, slowly walking through the cluster several times, but this does not fix things.

Current Status:
# ceph health
HEALTH_ERR Reduced data availability:
3021 pgs inactive, 23 pgs down, 23 pgs stale;
Degraded data redundancy: 3021 pgs unclean, 1879 pgs degraded,
1879 pgs undersized; 1 stuck requests are blocked > 4096 sec

ceph health detail (see http://p.ip.fi/Pwdb ) contains many lines such as:

pg 4.ffe is stuck unclean for 470551.768849, current state activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178]

pg 4.fff is stuck undersized for 49509.580577, current state activating+undersized+degraded+remapped, last acting [44,12,185,125,69,29,119,102,81,2147483647]

(Presumably the OSD number "2147483647" is due to Erasure Encoding,
as per <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001660.html> ?)

Tailing the stuck osd log with debug osd = 20 shows this:

2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer 0x5647cb336800 already has epoch 15482 2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482 tick_without_osd_lock 2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482 scrub_random_backoff lost coin flip, randomly backing off 2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; target 25 obj/sec or 5120 k bytes/sec 2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482 promote_throttle_recalibrate new_prob 1000 2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482 promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000 2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer 0x5647cabf3800 already has epoch 15482

Currently this cluster is just storing scratch data, so could be wiped, however we would be more confident about using ceph widely if we can fix errors like this...

thanks for reading, any advice appreciated,

Jake
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux