Re: Ceph with Clos IP fabric

Christian Balzer <chibi@xxxxxxx> · Mon, 24 Apr 2017 14:23:19 +0900

On Sun, 23 Apr 2017 21:59:17 -0700 Richard Hesse wrote:

> It's not a requirement to build out homogeneous racks of ceph gear. Most
> larger places don't do that (it creates weird hot spots).  If you have 5
> racks of gear, you're better off spreading out servers in those 5 than just
> a pair of racks that are really built up. In Aaron's case, he can easily do
> that since he's not using a cluster network.
> 
If you have 5 racks of space, sure.

With our ISP related stuff (no Ceph there though) I'm spreading things out
especially with mailbox cluster servers to avoid... floor loading issues.

Those 60 HDD top loader servers from Supermicro are heavy. ^o^

> Just be sure to dial in your crush map and failure domains with only a pair
> of installed cabinets.
> 
> Thanks for sharing Christian! It's always good to hear about how others are
> using and deploying Ceph, while coming to similar and different conclusions.
> 
> Also,when you say datacenter space is expensive, are you referring to power
> or actual floor space? Datacenter space is almost always sold by power and
> floor space is usually secondary. Are there markets where that's opposite?
> If so, those are ripe for new entrants!
> 
I'm pretty sure no "new entrants" (at least not low-cost ones) will be
showing up to sell DC space in the most coveted location in Tokyo, the
Otemachi Telecom and IX concentration.
Pretty much any building in this map view has DC or IX facilities in it:
https://www.google.com/maps/place/Otemachi+Financial+City+South+Tower/@35.6867366,139.7653056,18z/data=!4m5!3m4!1s0x60188c071080d7cb:0xabdd225cf834a9c8!8m2!3d35.6876625!4d139.7660164

So yeah, while power will also factor in, actual space is at a premium,
unless you opt for the sticks and that's not desirable for a number of
reasons, latency amongst them. 

And as I said, actually getting racks in these places can be difficult, we
luckily reserved 3 racks about 2 rows over from our current "unit" and thus
can build the next unit there w/o much interconnect overhead.
All the directly neighboring racks or rows were already full by the time
we did build the first unit...

Christian

> 
> On Apr 23, 2017 7:56 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:
> 
> 
> Hello,
> 
> Aaron pretty much stated most of what I was going to write, but to
> generalize things and make some points more obvious, I shall pipe up as
> well.
> 
> On Sat, 22 Apr 2017 21:45:58 -0700 Richard Hesse wrote:
> 
> > Out of curiosity, why are you taking a scale-up approach to building your
> > ceph clusters instead of a scale-out approach? Ceph has traditionally been
> > geared towards a scale-out, simple shared nothing mindset.  
> 
> While true, scale-out does come at a cost:
> a) rack space, which is mighty expensive where we want/need to be and also
> of limited availability in those locations.
> b) increased costs by having more individual servers, as in having two
> servers with 6 OSDs versus 1 with 12 OSDs will cost you about 30-40% more
> at the least (chassis, MB, PSU, NIC).
> 
> And then there is the whole scale thing in general, I'm getting the
> impression that the majority of Ceph users have small to at best medium
> sized clusters, simply because they don't need all that much capacity (in
> terms of storage space).
> 
> Case in point, our main production Ceph clusters fit into 8-10U with 3-4
> HDD based OSD servers and 2-4 SSD based cache tiers, obviously at this
> size with everything being redundant (switches, PDU, PSU).
> Serving hundreds (nearly 600 atm) of VMs, with a planned peak around
> 800 VMs.
> That Ceph cluster will never have to grow beyond this size.
> For me Ceph (RBD) was/is a more scalable approach than DRBD, allowing for
> n+1 compute node deployments instead of having pairs (where one can't live
> migrate to outside of this pair).
> 
> >These dual ToR
> > deploys remind me of something from EMC, not ceph. Really curious as I'd
> > rather have 5-6 racks of single ToR switches as opposed to three racks of
> > dual ToR. Is there a specific application or requirement? It's definitely
> > adding a lot of complexity; just wondering what the payoff is.
> >  
> 
> If you have plenty of racks, bully for you.
> Though personally I'd try to keep failure domains (especially when they
> are as large as full rack!) to something like 10% of the cluster.
> We're not using Ethernet for the Ceph network (IPoIB), but if we were it
> would be dual TORS with MC-LAG (and dual PSU, PDU) all the way.
> Why have a SPOF that WILL impact your system (a rack worth of data
> movement) in the first place?
> 
> Regards,
> 
> Christian
> 
> > Also, why are you putting your "cluster network" on the same physical
> > interfaces but on separate VLANs? Traffic shaping/policing? What's your
> > link speed there on the hosts? (25/40gbps?)
> >
> > On Sat, Apr 22, 2017 at 12:13 PM, Aaron Bassett <  
> Aaron.Bassett@xxxxxxxxxxxxx
> > > wrote:  
> >  
> > > FWIW, I use a CLOS fabric with layer 3 right down to the hosts and
> > > multiple ToRs to enable HA/ECMP to each node. I'm using Cumulus Linux's
> > > "redistribute neighbor" feature, which advertises a /32 for any ARP'ed
> > > neighbor. I set up the hosts with an IP on each physical interface and  
> on
> > > an aliased looopback: lo:0. I handle the separate cluster network by  
> adding
> > > a vlan to each interface and routing those separately on the ToRs with  
> acls
> > > to keep traffic apart.
> > >
> > > Their documentation may help clarify a bit:
> > > https://docs.cumulusnetworks.com/display/DOCS/Redistribute+
> > > Neighbor#RedistributeNeighbor-ConfiguringtheHost(s)
> > >
> > > Honestly the trickiest part is getting the routing on the hosts right,  
> you
> > > essentially set static routes over each link and the kernel takes care  
> of
> > > the ECMP.
> > >
> > > I understand this is a bit different from your setup, but Ceph has no
> > > trouble at all with the IPs on multiple interfaces.
> > >
> > > Aaron
> > >
> > > Date: Sat, 22 Apr 2017 17:37:01 +0000
> > > From: Maxime Guyot <Maxime.Guyot@xxxxxxxxx>
> > > To: Richard Hesse <richard.hesse@xxxxxxxxxx>, Jan Marquardt
> > > <jm@xxxxxxxxxxx>
> > > Cc: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
> > > Subject: Re:  Ceph with Clos IP fabric
> > > Message-ID: <919C8615-C50B-4611-9B6B-13B4FBF69C61@xxxxxxxxx>
> > > Content-Type: text/plain; charset="utf-8"
> > >
> > > Hi,
> > >
> > > That only makes sense if you're running multiple ToR switches per rack  
> for
> > > the public leaf network. Multiple public ToR switches per rack is not  
> very
> > > common; most Clos crossbar networks run a single ToR switch. Several  
> > > >guides on the topic (including Arista & Cisco) suggest that you use  
> > > something like MLAG in a layer 2 domain between the switches if you need
> > > some sort of switch redundancy inside the rack. This increases  
> complexity,
> > > and most people decide that it's not worth it and instead scale out  
> across
> > > racks to gain the redundancy and survivability that multiple ToR offer.
> > >
> > > If you use MLAG for L2 redundancy, you?ll still want 2 BGP sessions for  
> L3
> > > redundancy, so why not skipping the MLAG all together and terminating  
> your
> > > BGP session on each ToR?
> > >
> > > Judging by the routes (169.254.0.1), you are using BGP unnumebered?
> > >
> > > It sounds like the ?ip route get? output you get when using dummy0 is
> > > caused by a fallback on the default route, supposedly on eth0? Can check
> > > the exact routes received on server1 with ?show ip bgp neighbors  
> <neighbor>
> > > received-routes? once you enable ?neighbor <neighbor>  
> soft-reconfiguration
> > > inbound? and what?s installed in the table ?ip route??
> > >
> > >
> > > Intrigued by this problem, I tried to reproduce it in a lab with
> > > virtualbox. I ran into the same problem.
> > >
> > > Side note: Configuring the loopback IP on the physical interfaces is
> > > workable if you set it on **all** parallel links. Example with server1:
> > >
> > > ?iface enp3s0f0 inet static
> > >  address 10.10.100.21/32
> > > iface enp3s0f1 inet static
> > >  address 10.10.100.21/32
> > > iface enp4s0f0 inet static
> > >  address 10.10.100.21/32
> > > iface enp4s0f1 inet static
> > >  address 10.10.100.21/32?
> > >
> > > This should guarantee that the loopback ip is advertised if one of the 4
> > > links to switch1 and switch2 is up, but I am not sure if that?s workable
> > > for ceph?s listening address.
> > >
> > >
> > > Cheers,
> > > Maxime
> > >
> > > From: Richard Hesse <richard.hesse@xxxxxxxxxx>
> > > Date: Thursday 20 April 2017 16:36
> > > To: Maxime Guyot <Maxime.Guyot@xxxxxxxxx>
> > > Cc: Jan Marquardt <jm@xxxxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx" <
> > > ceph-users@xxxxxxxxxxxxxx>
> > > Subject: Re:  Ceph with Clos IP fabric
> > >
> > > On Thu, Apr 20, 2017 at 2:13 AM, Maxime Guyot <Maxime.Guyot@xxxxxxxxx<
> > > mailto:Maxime.Guyot@xxxxxxxxx <Maxime.Guyot@xxxxxxxxx>>> wrote:
> > >
> > > 2) Why did you choose to run the ceph nodes on loopback interfaces as
> > > opposed to the /24 for the "public" interface?
> > >
> > > I can?t speak for this example, but in a clos fabric you generally want  
> to
> > > assign the routed IPs on loopback rather than physical interfaces. This  
> way
> > > if one of the link goes down (t.ex the public interface), the routed IP  
> is
> > > still advertised on the other link(s).
> > >
> > > That only makes sense if you're running multiple ToR switches per rack  
> for
> > > the public leaf network. Multiple public ToR switches per rack is not  
> very
> > > common; most Clos crossbar networks run a single ToR switch. Several  
> guides
> > > on the topic (including Arista & Cisco) suggest that you use something  
> like
> > > MLAG in a layer 2 domain between the switches if you need some sort of
> > > switch redundancy inside the rack. This increases complexity, and most
> > > people decide that it's not worth it and instead  scale out across  
> racks to
> > > gain the redundancy and survivability that multiple ToR offer.
> > >
> > > On Thu, Apr 20, 2017 at 4:04 AM, Jan Marquardt <jm@xxxxxxxxxxx<mailto:  
> jm@
> > > artfiles.de <jm@xxxxxxxxxxx>>> wrote:
> > >
> > > Maxime, thank you for clarifying this. Each server is configured like  
> this:
> > >
> > > lo/dummy0: Loopback interface; Holds the ip address used with Ceph,
> > > which is announced by BGP into the fabric.
> > >
> > > enp5s0: Management Interface, which is used only for managing the box.
> > > There should not be any Ceph traffic on this one.
> > >
> > > enp3s0f0: connected to sw01 and used for BGP
> > > enp3s0f1: connected to sw02 and used for BGP
> > > enp4s0f0: connected to sw01 and used for BGP
> > > enp4s0f1: connected to sw02 and used for BGP
> > >
> > > These four interfaces are supposed to transport the Ceph traffic.
> > >
> > > See above. Why are you running multiple public ToR switches in this  
> rack?
> > > I'd suggest switching them to a single layer 2 domain and participate in
> > > the Clos fabric as a single unit, or scale out across racks (preferred).
> > > Why bother with multiple switches in a rack when you can just use  
> multiple
> > > racks? That's the beauty of Clos: just add more spines if you need more
> > > leaf to leaf bandwidth.
> > >
> > > How many OSD, servers, and racks are planned for this deployment?
> > >
> > > -richard
> > >
> > >
> > > CONFIDENTIALITY NOTICE
> > > This e-mail message and any attachments are only for the use of the
> > > intended recipient and may contain information that is privileged,
> > > confidential or exempt from disclosure under applicable law. If you are  
> not
> > > the intended recipient, any disclosure, distribution or other use of  
> this
> > > e-mail message or attachments is prohibited. If you have received this
> > > e-mail message in error, please delete and notify the sender  
> immediately.
> > > Thank you.
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >  
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com