Hi Anthony and Phil, since my meltdown case was mentioned and I might have a network capacity issue, here a question about why having separate VLANS for private and public network might have its merits: In our part of the ceph cluster that was overloaded (our cluster has 2 sites logically separate and physically different), I see a lot of dropped packets on the spine switch and it looks like its the downlinks to the leafs where storage servers. I'm still not finished investigating, so a network overload is still a hypothetical part of our meltdown. The question below should, however, be interesting in any case as it might help prevent a meltdown in case of similar set ups. Our network connectivity is as follows: we have 1 storage server and up to 18 clients per leaf. The storage servers have 6x10G connectivity in an LACP bond and front- and back-network share all ports but are separated by VLAN. The clients have 1x10G on the public network. Unfortunately, currently the up-links from leaf to spine switches are limited to 2x10G. We are in the progress of upgrading to 2x40, so let's ignore fixing this temporary bottleneck here (Corona got in the way) and focus on workarounds until we can access the site again. For every write, currently every storage server is hit (10 servers with 8+2EC). Since we believed the low uplink bandwidth to be short time only during a network upgrade, we were willing to accept the low bandwidth assuming that the competition between client and storage traffic would throttle the clients sufficiently much to result in a working system maybe with reduced performance but not becoming unstable. The questions relevant to this thread: I kept the separation into public and cluster network, because this enables QOS definitions, which are typical per VLAN. In my situation, what if the up-links were saturated by the competing client- and storage server traffic? Both run on the same VLAN, obviously. The only way to make space for the OSD/heartbeat traffic would be to give the cluster network VLAN higher priority over public network by QOS settings. This should at least allow the OSDs to continue checking heartbeats etc. over a busy line. Is this correct? This also raises a question I had a long time ago and was also raised by Anthony. Why are the MONs not on the cluster network? If I can make a priority line for the OSDs, why can't I make OSD-MON communication a priority too? While digging through heartbeat options as a consequence of our meltdown, I found this one: # ceph daemon osd.0 config show | grep heart ... "osd_heartbeat_addr": "-", ... # ceph daemon mon.ceph-01 config show | grep heart ... "osd_heartbeat_addr": "-", ... Is it actually possible to reserve a dedicated (third) VLAN with high QOS to heartbeat traffic by providing a per-host IP address to this parameter? What does this parameter do? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Anthony D'Atri <anthony.datri@xxxxxxxxx> Sent: 09 May 2020 23:59:49 To: Phil Regnauld Cc: ceph-users@xxxxxxx Subject: Re: Cluster network and public network >> If your public network is saturated, that actually is a problem, last thing you want is to add recovery traffic, or to slow down heartbeats. For most people, it isn’t saturated. > > See Frank Schilder's post about a meltdown which he believes could have > been caused by beacon/hearbeat being drowned out by other recovery/IO > trafic, not at the network level, but at the processing level on the OSDs. > > If indeed there are cases where the OSDs are too busy to send (or process) > heartbeat/beacon messaging, it wouldn't help to have a separate network ? Agreed. Many times I’ve had to argue that CPUs that aren’t nearly saturated *aren’t* necessarily overkill, especially with fast media where latency hurts. It would be interesting to consider an architecture where a core/HT is dedicated to the control plane. That said, I’ve seen a situation where excessive CPU appeared to affect latency by allowing the CPUs to drop C-states, this especially affected network traffic (2x dual 10GE). Curiously some systems in the same cluster experienced this but some didn’t. There was a mix of Sandy Bridge and Ivy Bridge IIRC, as well as different Broadcom chips. Despite an apparently alignment with older vs newer Broadcom chip, I never fully characterized the situation — replacing one of the Broadcom NICs in an affected system with the model in use on unaffected systems diddn’t resolve the issue. It’s possible that replacing the other wwould have made a difference. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx