Re: Switches and latency

Christian Balzer <chibi@xxxxxxx> · Thu, 16 Jun 2016 10:53:29 +0900

Hello,

Gandalf, first read:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29546.html

And this thread by Nick:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29708.html

More comments inline.

On Wed, 15 Jun 2016 22:26:51 +0100 Nick Fisk wrote:

> > -----Original Message-----
> > From: Gandalf Corvotempesta [mailto:gandalf.corvotempesta@xxxxxxxxx]
> > Sent: 15 June 2016 22:13
> > To: nick@xxxxxxxxxx
> > Cc: ceph-users@xxxxxxxx
> > Subject: Re:  Switches and latency
> > 
> > 2016-06-15 22:59 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
> > > Possibly, but by how much? 20GB of bandwidth is a lot to feed 12x7.2k
> > disks, particularly if they start doing any sort of non-sequential IO.
> > 
> > Assuming 100MB/s for each SATA disk, 12 disks are 1200MB/s = 9600mbit/s
> > Why are you talking about 20Gb/s ? By using VLANs on the same port for
> > both public and cluster traffic, i'll have 10Gb/s to share, but all
> > disks can saturate the whole nic (9600mbit/s on a 10000mbit/s network)
> 
> So this is probably a very optimistic figure, any sort of non 4MB
> sequential workload will rapidly decrease this number, are you planning
> on using SSD journals, this will impact the possible bandwidth you will
> achieve.
> 
Overly optimistic. 
In an idle cluster with synthetic tests you might get sequential reads
that are around 150MB/s per HDD.
As for writes, think 80MB/s, again in an idle cluster.

Any realistic, random I/O and you're looking at 50MB/s at most either way.

So your storage nodes can't really saturate even a single 10Gb/s link in
real life situations. 

Journal SSDs can improve on things, but that's mostly for IOPS. 
In fact they easily become the bottleneck bandwidth wise and are so on
most of my storage nodes.
Because you'd need at least 2 400GB DC S3710 SSDs to get around 1GB/s
writes, or one link worth.

Splitting things in cluster and public networks ONLY makes sense when your
storage node can saturate ALL the network bandwidth, which usually is only
the case when it comes to very expensive SSD/NVMe only nodes.

Going back to your original post, with a split network the latency in both
networks counts the same, as a client write will NOT be acknowledged until
it has reach the journal of all replicas, so having a higher latency
cluster network is counterproductive.

And again, in real life you'll run out of IOPS long before you run out of
bandwidth, I/O or network wise.

> I was assuming each node has 2 Nic's in a Bond going to separate switch.
> You get 20Gb/s of bandwidth and redundancy.
> 
> > 
> > I can't aggregate 2 ports, or I have to buy stackable switches with
> > support for LAG across both switches, much more expansive.
> > And obviously I can't use only one switch. Network must be fault
> > tollerance.
> 
> As above, check out the linux bonding options. ALB mode gives both RX
> and TX load balancing, although I think it may have some weird fringe
> cases you need to test before going live with it.
> 
Look at alternative MC-LAG capable switches from Penguin, Quanta, etc.
These tend to be half the price of similar offerings from Brocade or Cisco.

Or if you can start with a clean slate (including the clients), look at
Infiniband. 
All my production clusters are running entirely IB (IPoIB currently) and
I'm very happy with the performance, latency and cost.

> > 
> > > I think you want to try and keep it simple as possible and make the
> > > right
> > decision 1st time round. Buy a TOR switch that will accommodate the
> > number of servers you wish to put in your rack and you should never
> > have a need to change it.
> > >
> > > I think there are issues when one of networks is down and not the
> > > other,
> > so stick to keeping each server terminating into the same switch for
> > all its connections, otherwise you are just inviting trouble to happen.
> > 
> > This is not good. a network could fail. In a HA cluster, network
> > failure must be taken in consideration.
> > What I would like to do is to unplug cable from switch 1 and plug to
> > switch 2. a couple of seconds max. (obviously switch2 will be
> > temporary connected to switch1)
> 

You will want 2 switches and 2 ports on each host.
Just use one network or if you feel ambitious, use VLANs for public and
cluster, your choice.

If anyhow possible/affordable, run LACP on your host ports to your MC-LAG
capable switches.

If you can't afford this, spend time (to learn and test) instead of money
on running OSFP equal cost multi-path on your storage nodes and get the
same benefits, fully redundant and load-balanced links.

Lastly, if you can't do either of these, run your things in ALB (may not
work) or simple fail-over mode. 10Gb/s is going to be fast enough in
nearly all situations you'll encounter with these storage nodes.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com