Re: Create a back network?

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Sat, 15 Feb 2025 10:22:12 -0500

> Hello Nicola,
> 
> In my world, any cluster that has a backend network different from the
> public network is misdesigned and broken.

I wouldn’t go quite so far as to describe it as “broken”.  I do think it was hatched early on when both the efficiency of Ceph replication/backfill and commonly-deployed networking tech were in earlier states of development.  Some of our community must needs make do with leaner resources than we would like.  Including the false economy of HDDs but I digress...

Back then my sense is that a significant fraction of deployments were using 1GE networking, and for sure both CRUSH and replication contributed to more data moving around than was strictly necessary.  And the default backfill/recovery values were much, much higher than they are today.  So one of the motivations for such architecture was to ensure that client and replication traffic did not DoS each other.

These days I suggest 25/50/100GE networking tech in most cases, but 10GE is still quote common, and in many cases adequate esp. when slow OSD media are in play or client workload is light.  I think I mention elsewhere in the docs that one can actually bond more than two network links, though this is seldom done.  It’s important in any bonding scenario to choose the appropriate xmit hash policy to use the additional throughput, but I like to stress that bonding is primarily for availability and that one should plan for a scenario where only one link’s worth of throughput is available.

I think it was the inestimable Wido Hollander who first got me to drink the no-replication-network Kool-Aid, and I softened the guidance in the docs after discussion with Sage Weil.  There are legitimate cases for a separate replication network, e.g. when one is stuck with … vintage network infrastructure.  Indeed I recall having written, I think inn my book but perhaps in the public docs, that sometimes if one has a limited public network,  deploying an isolated replication network can be advantageous especially if one can step up to a faster layer 1 for it without having to spend to revamp the entire enterprise.

> The reason is clearly spelled out
> here
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#flapping-osds
> 
>> When a private network (or even a single host link) fails or degrades while
>> the public network continues operating normally, OSDs may not handle this
>> situation well. In such situations, OSDs use the public network to report
>> each other down to the monitors, while marking themselves up. The
>> monitors then send out-- again on the public network--an updated cluster
>> map with the affected OSDs marked down. These OSDs reply to the monitors
>> “I’m not dead yet!”, and the cycle repeats. We call this scenario
>> ‘flapping`, and it can be difficult to isolate and remediate. Without a
>> private network, this irksome dynamic is avoided: OSDs are generally either
>> up or down without flapping.

It warms the limpets of my heart when someone quotes something I wrote <3
Especially when it invokes Monty Python as an allegorical tool.

Several difficult to troubleshoot instances of the above were the breaking point for me.

> And when such flapping happens, the cluster essentially halts, from the
> client perspective, because of the uninterrupted stream of pauses for
> peering.

I experienced frequent dips in performance but not quite “halts” .  ymmv.

> 
> The only situations where I would just growl instead of complaining as
> loudly as possible are:
> 
> 1. The user has an LACP bond of two interfaces for the backend network,
> connected to two different switches in a MC-LAG configuration, plus a
> monitoring system that raises even a single link down as a critical
> incident.

This is one reason to eschew active/passive bonding:  the passive link can be broken without one knowing until you need it, e.g. during switch maintenance.  This is one of the first things I fixed at a prior job.

> 
> 2. The user proves that the backend network cannot fail without the public
> network also failing. This includes the case where both are implemented
> using VLANs on the same NIC, plus possibly some QoS settings on the switch.
> 
> 
> On Sat, Feb 15, 2025 at 8:58 AM Nicola Mori <nicolamori@xxxxxxx> wrote:
> 
>> Dear Ceph users,
>> 
>> I have a question about the benefits of creating a back network for my
>> cluster. Currently, my machines are all connected to a gigabit switch,
>> each with 2 bonded interfaces. The cluster just have a front network.
>> Recently I got a new gigabit switch, so now I have 2 and I was wondering
>> if there would be any benefit in setting up a back network on one switch
>> and a front network on the other, using one interface per network. In
>> this way I'll loose the redundancy, but other benefits and drawbacks are
>> not clear to me.
>> Would some expert help me in better understanding this issue, please?
>> Thanks in advance,
>> 
>> Nicola
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> 
> 
> -- 
> Alexander Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx