Ceph monitors overloaded on large cluster restart

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Wed, 19 Dec 2018 17:34:56 -0500

Dear ceph users,

We have a large-ish ceph cluster with about 3500 osds.  We run 3 mons on 
dedicated hosts, and the mons typically use a few percent of a core, and 
generate about 50Mbits/sec network traffic.  They are connected at 
20Gbits/sec (bonded dual 10Gbit) and are running on 2x14 core servers.

We recently had to shut ceph down completely for maintenance (which we 
rarely do), and had significant difficulties starting it up.  The 
symptoms included OSDs hanging on startup, being marked down, flapping 
and all that bad stuff.  After some investigation we found that the 
20Gbit/sec network interfaces of the monitors were completely saturated 
as the OSDs were starting, while the monitor processes were using about 
3 cores (300% CPU).  We ended up having to start the OSDs up super slow 
to make sure that the monitors could keep up - it took about 4 hours to 
start 3500 OSDs (at a rate about 4 seconds per OSD).  We've tried 
setting noout and nodown, but that didn't really help either.  A few 
questions that would be good to understand in order to move to a better 
configuration.

1. How does the monitor traffic scale with the number of OSDs? 
Presumably the traffic comes from distributing cluster maps as the 
cluster changes on OSD starts.  The cluster map is perhaps O(N) for N 
OSDs, and each OSD needs an update on a cluster change so that would 
make one change an O(N^2) traffic.  As OSDs start, the cluster changes 
quite a lot (N times?), so would that make the startup traffic O(N^3)?  
If so, that sounds pretty scary for scalability.

2. Would adding more monitors help here?  I.e. presumably each OSD gets 
its maps from one monitor, so they would share the traffic. Would the 
inter-monitor communication/elections/etc. be problematic for more 
monitors (5, 7 or even more)?  Would more monitors be recommended?  If 
so, how many is practical?

3. Are there any config parameters useful for tuning the traffic 
(perhaps send mon updates less frequently, or something along those lines)?

Any other advice on this topic would also be helpful.

Thanks,

Andras

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com