Re: Switches and latency

Christian Balzer <chibi@xxxxxxx> · Thu, 16 Jun 2016 22:42:37 +0900

Hello,

On Thu, 16 Jun 2016 12:44:51 +0200 Gandalf Corvotempesta wrote:

> 2016-06-16 3:53 GMT+02:00 Christian Balzer <chibi@xxxxxxx>:
> > Gandalf, first read:
> > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29546.html
> >
> > And this thread by Nick:
> > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29708.html
> 
> Interesting reading. Thanks.
> 
> > Overly optimistic.
> > In an idle cluster with synthetic tests you might get sequential reads
> > that are around 150MB/s per HDD.
> > As for writes, think 80MB/s, again in an idle cluster.
> >
> > Any realistic, random I/O and you're looking at 50MB/s at most either
> > way.
> >
> > So your storage nodes can't really saturate even a single 10Gb/s link
> > in real life situations.
> 
> Ok.
> 
> > Journal SSDs can improve on things, but that's mostly for IOPS.
> > In fact they easily become the bottleneck bandwidth wise and are so on
> > most of my storage nodes.
> > Because you'd need at least 2 400GB DC S3710 SSDs to get around 1GB/s
> > writes, or one link worth.
> 
> I plan to use 1 or 2 SSD journal (probably, 1 SSD every 6 spinning disks)
> 
That's as large as I would make that failure domain, also make sure to
choose SSDs that work well with Ceph, endurance and sync write speed wise
(lots of threads about this).

> > Splitting things in cluster and public networks ONLY makes sense when
> > your storage node can saturate ALL the network bandwidth, which
> > usually is only the case when it comes to very expensive SSD/NVMe only
> > nodes.
> 
> This is not my case.
> 
> > Going back to your original post, with a split network the latency in
> > both networks counts the same, as a client write will NOT be
> > acknowledged until it has reach the journal of all replicas, so having
> > a higher latency cluster network is counterproductive.
> 
> Ok.
> 
> > Or if you can start with a clean slate (including the clients), look at
> > Infiniband.
> > All my production clusters are running entirely IB (IPoIB currently)
> > and I'm very happy with the performance, latency and cost.
> 
> Yes, i'll start with a brand new network.

Alas you're not really, as you say below your clients already have 10GigE
ports.
So I'll be terse from here.

> Acutally i'm testing with some old IB switches (DDR) and i'm not very
> happy, as IPoIB doesn't go over 8/9Gbit/s in a DDR. 
You should have gotten some insight out of your "RDMA/Infiniband status"
thread, but I never bothered with DDR.

> Additionally, CX4
> cables used by DDR are... HUGE and very "hard" to bend in the rack.
> I don't know if QDR cables are thinner.
> 
Nope, but then again some of the 10GigE cables are rather stiff or
fragile, too.

> Are you using QDR? 
Yes, because it's cheaper, we don't need the bandwidth and the latency is
supposedly lower than FDR.

> I've seen a couple of mellanox used switches on ebay
> that seems to be ok for me. 36 QDR ports would be awesome but I don't
> have any IB knowledge.
Largest switch we use is 18 ports, our clusters are small.

> Could I keep the IB fabric unconfigured and use only IPoIB ?
Pretty much, a very basic OpenSM config and you're good to go.

> I can create a bonded (failover) IPoIB device on each node and add 2 or
> more IB cables between both switches. In a normal Ethernet network,
> these 2 cables must be joined in a LAG to avoid loops. Is infiniband
> able to manage this on their
> own ? 
Yes. The basic/default OpenSM router needs to be told to use more than one
path if possible "lmc 2", other routers are also available and have more bells and
whistles.

>I've never find a way to aggragate multiple ports.
Re-read that OSPF thread...
IB has means to do this, alas IPoIB bonding only supports failover at this
time, yes.

> The real drawback with IB is that I have to add IB cards on each compute
> nodes, where my current compute nodes a 2 10GBaseT ports onboard.
> 
> This add some costs....
> 
Then look at the 10GigE options I listed.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com