Network design issues

Frank Schilder <frans@xxxxxx> · Fri, 12 Feb 2021 10:24:33 +0000

Dear cephers,

I believe we are facing a bottleneck due to an inappropriate overall network design and would like to hear about experience and recommendations. I start with a description of the urgent problem/question and follow up with more details/questions.

These observations are on our HPC home file system served with ceph. It has 12 storage servers facing 550+ client servers.

Under high load, I start seeing "slow ping time" warnings with quite incredible latencies. I suspect we have a network bottleneck. On the storage servers we have 6x10G LACP trunks. Clients are on single 10G NICs. We have separate VLANs for front- and back network, but they both go through all NICs in the same way, so, technically, its just one cluster network shared with clients. The aggregated bandwidth is sufficient for a single-node storage server load (roughly matches with disk controller IO capacity). However, point-to-point connections are 10G only and I believe that we start observing clients saturating a 10G link and starving all other ceph cluster traffic that needs to go through this link as well. This, in turn, leads to backlog effects with slow ops on unrelated OSDs, affecting overall user experience. The number of OSDs reporting slow ping times is about the percentage one would expect if one or two 10G links are congested. Its usually just one storage server that coughs up.

I guess the users with aggressive workloads getting the full bandwidth are happy, but everyone else is complaining. What I observe is that one or two clients can DOS everyone else. I typically see a very high read bandwidth from a few OSDs only and my suspicion is that this is a large job of 50-100 nodes starting the same application at the same time. For example, 50-100 clients reading the same executable simultaneously. I see 5-6GB/s and up to 10K IOP/s read, which is really good in principle. Except that is not fair-shared with other users.

Question: I start considering to enable QOS on the switches for traffic between storage servers and would like to know if anyone is doing this and what the experience is. Unfortunately, our network design is probably flawed and makes this now difficult; see below.

More Info.

Our FS data pool is EC 8+2. I have fast-read enabled. Hence, the network traffic amplification for both, read and write, is quite substantial.

Our network is a spine-leaf architecture where ceph servers and ceph clients are distributed more or less equally over the leaf switches. I'm afraid that this is a first flaw in the design, because storage servers and clients compete for the same switches and the clients greatly outnumber the storage servers. It also makes implementing QOS a real pain while it could be just traffic shaping on an uplink trunk to clients if the storage servers were isolated.

This is the first design question: Isolated storage cluster providing service via uplinks/gateways versus "integrated/hyper-converged" where storage servers and clients are distributed equally over a spine-leaf architecture. Pros and cons?

We have a 100G spine VLT-pair with ports configured as 40G. Up-links from leafs are 2x40, in fact, we have these leafs configured as VLT-pairs for HA as well. A pair has 2x2x40G uplinks and 2x40G VLT interlinks. There are 2 ceph servers per VLT leaf-pair and ca. 85+ client servers on the same pair. There are also clients on leaf switches without ceph servers. I don't think the 40G uplinks are congested, but you never know.

We started with the ceph servers having 15HDDs for fs data and 1 SSD for fs meta-data each. With this configuration, the disk speed was the bottleneck and I observed slow ops under high load, but everything was more or less stable. I recently changed an MDS setting that greatly improved both, client performance and also the client's ability to overload OSDs. In addition, one week ago I added 20HDDs in a JBOD per host, which more than doubled the HDD throughput. Both increases in performance together have now the counter-intuitive effect that aggregated performance has tripled in comparison to 2 months ago, but the user experience is very erratic. My suspicion is, as explained above, that each server can now handle a volume of traffic that easily saturates a 10G link, leading to observations that seem to indicate insufficient network capacity whenever too many client/cluster requests go through the same 10G link.

In essence, we increased aggregated performance greatly but users complain more than ever.

I suspect that this imbalance of server throughput ability and 10G point-to-point limitation is a problem. However, I cannot change the networking and would like some advice of how similar set-ups are configured and if QOS can help. My idea is to enable dot1p layer 2 QOS and give traffic coming from ports with storage servers connected a higher priority than traffic coming from everywhere else. I know it would be a lot simpler if the storage cluster was isolated, but I have to deal with the situation as is for now. Any advice and experience is highly appreciated.

If I do it, should I do QOS on both, front- and back network, or is QOS on the VLAN for back-network enough? Note that MONs are only on the front network.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx