Re: Ceph with Clos IP fabric

Aaron Bassett <Aaron.Bassett@xxxxxxxxxxxxx> · Sun, 23 Apr 2017 10:45:34 +0000

We have space limitations in our DCs and so have to build as densely as possibly. These clusters are two racks of 500 osds each, though there is more hardware en route to start scaling them out. With just two racks, the risk of losing a ToR and taking down
 the cluster was enough to justify the slight added complexity of extra ToRs to ensure we have HA at that point in the architecture. It's not adding that much complexity, as it's all handled by configuration management once you get the kinks worked out the
 first time. We use this architecture throughout our networks, so running it for ceph is not any different than running it for any of our other service. I find it to be less complex and easier to debug than doing an MLAG setup as well. 

We are currently running hosts with dual 10G nics, one to each ToR, but are evaluating 25 or 40 for upcoming deploys.

Once we gain confidence in ceph to expand  beyond a couple thousand osds in a cluster, I will certainly look to simplify by cutting down to one higher-throughput ToR per rack. 

The logical public/private separation is to keep the traffic on a separate network and for ease of monitoring. 

Aaron 

On Apr 23, 2017, at 12:45 AM, Richard Hesse <richard.hesse@xxxxxxxxxx> wrote:

Out of curiosity, why are you taking a scale-up approach to building your ceph clusters instead of a scale-out approach? Ceph has traditionally been geared towards a scale-out, simple shared nothing mindset. These dual ToR deploys remind
 me of something from EMC, not ceph. Really curious as I'd rather have 5-6 racks of single ToR switches as opposed to three racks of dual ToR. Is there a specific application or requirement? It's definitely adding a lot of complexity; just wondering what the
 payoff is.

Also, why are you putting your "cluster network" on the same physical interfaces but on separate VLANs? Traffic shaping/policing? What's your link speed there on the hosts? (25/40gbps?)

On Sat, Apr 22, 2017 at 12:13 PM, Aaron Bassett 
<Aaron.Bassett@xxxxxxxxxxxxx> wrote:

FWIW, I use a CLOS fabric with layer 3 right down to the hosts and multiple ToRs to enable HA/ECMP to each node. I'm using Cumulus Linux's "redistribute neighbor" feature, which advertises a /32 for any ARP'ed neighbor.
 I set up the hosts with an IP on each physical interface and on an aliased looopback: lo:0. I handle the separate cluster network by adding a vlan to each interface and routing those separately on the ToRs with acls to keep traffic apart. 

Their documentation may help clarify a bit:
https://docs.cumulusnetworks.com/display/DOCS/Redistribute+Neighbor#RedistributeNeighbor-ConfiguringtheHost(s)

Honestly the trickiest part is getting the routing on the hosts right, you essentially set static routes over each link and the kernel takes care of the ECMP.

I understand this is a bit different from your setup, but Ceph has no trouble at all with the IPs on multiple interfaces. 

Aaron 

Date: Sat, 22 Apr 2017 17:37:01 +0000

From: Maxime Guyot <Maxime.Guyot@xxxxxxxxx>

To: Richard Hesse <richard.hesse@xxxxxxxxxx>, Jan Marquardt

<jm@xxxxxxxxxxx>

Cc: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>

Subject: Re:  Ceph with Clos IP fabric

Message-ID: <919C8615-C50B-4611-9B6B-13B4FBF69C61@xxxxxxxxx>

Content-Type: text/plain; charset="utf-8"

Hi,

That only makes sense if you're running multiple ToR switches per rack for the public leaf network. Multiple public ToR switches per rack is not very common; most Clos crossbar networks run a single ToR switch. Several >guides
 on the topic (including Arista & Cisco) suggest that you use something like MLAG in a layer 2 domain between the switches if you need some sort of switch redundancy inside the rack. This increases complexity, and most people decide that it's not worth it and
 instead scale out across racks to gain the redundancy and survivability that multiple ToR offer.

If you use MLAG for L2 redundancy, you?ll still want 2 BGP sessions for L3 redundancy, so why not skipping the MLAG all together and terminating your BGP session on each ToR?

Judging by the routes (169.254.0.1), you are using BGP unnumebered?

It sounds like the ?ip route get? output you get when using dummy0 is caused by a fallback on the default route, supposedly on eth0? Can check the exact routes received on server1 with ?show ip bgp neighbors <neighbor> received-routes? once you enable
 ?neighbor <neighbor> soft-reconfiguration inbound? and what?s installed in the table ?ip route??

Intrigued by this problem, I tried to reproduce it in a lab with virtualbox. I ran into the same problem.

Side note: Configuring the loopback IP on the physical interfaces is workable if you set it on **all** parallel links. Example with server1:

?iface enp3s0f0 inet static

 address 
10.10.100.21/32

iface enp3s0f1 inet static

 address 
10.10.100.21/32

iface enp4s0f0 inet static

 address 
10.10.100.21/32

iface enp4s0f1 inet static

 address 
10.10.100.21/32?

This should guarantee that the loopback ip is advertised if one of the 4 links to switch1 and switch2 is up, but I am not sure if that?s workable for ceph?s listening address.

Cheers,

Maxime

From: Richard Hesse <richard.hesse@xxxxxxxxxx>

Date: Thursday 20 April 2017 16:36

To: Maxime Guyot <Maxime.Guyot@xxxxxxxxx>

Cc: Jan Marquardt <jm@xxxxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>

Subject: Re:  Ceph with Clos IP fabric

On Thu, Apr 20, 2017 at 2:13 AM, Maxime Guyot <Maxime.Guyot@xxxxxxxxx<mailto:Maxime.Guyot@xxxxxxxxx>>
 wrote:

2) Why did you choose to run the ceph nodes on loopback interfaces as opposed to the /24 for the "public" interface?

I can?t speak for this example, but in a clos fabric you generally want to assign the routed IPs on loopback rather than physical interfaces. This way if one of the link goes down (t.ex the public interface), the routed IP is still advertised on the
 other link(s).

That only makes sense if you're running multiple ToR switches per rack for the public leaf network. Multiple public ToR switches per rack is not very common; most Clos crossbar networks run a single ToR switch. Several guides on the topic (including Arista
 & Cisco) suggest that you use something like MLAG in a layer 2 domain between the switches if you need some sort of switch redundancy inside the rack. This increases complexity, and most people decide that it's not worth it and instead  scale out across racks
 to gain the redundancy and survivability that multiple ToR offer.

On Thu, Apr 20, 2017 at 4:04 AM, Jan Marquardt <jm@xxxxxxxxxxx<mailto:jm@artfiles.de>> wrote:

Maxime, thank you for clarifying this. Each server is configured like this:

lo/dummy0: Loopback interface; Holds the ip address used with Ceph,

which is announced by BGP into the fabric.

enp5s0: Management Interface, which is used only for managing the box.

There should not be any Ceph traffic on this one.

enp3s0f0: connected to sw01 and used for BGP

enp3s0f1: connected to sw02 and used for BGP

enp4s0f0: connected to sw01 and used for BGP

enp4s0f1: connected to sw02 and used for BGP

These four interfaces are supposed to transport the Ceph traffic.

See above. Why are you running multiple public ToR switches in this rack? I'd suggest switching them to a single layer 2 domain and participate in the Clos fabric as a single unit, or scale out across racks (preferred). Why bother with multiple switches in
 a rack when you can just use multiple racks? That's the beauty of Clos: just add more spines if you need more leaf to leaf bandwidth.

How many OSD, servers, and racks are planned for this deployment?

-richard

CONFIDENTIALITY NOTICE

This e-mail message and any attachments are only for the use of the intended recipient and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient, any disclosure, distribution
 or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com