Hello, On Thu, 16 Jun 2016 12:44:51 +0200 Gandalf Corvotempesta wrote: > 2016-06-16 3:53 GMT+02:00 Christian Balzer <chibi@xxxxxxx>: > > Gandalf, first read: > > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29546.html > > > > And this thread by Nick: > > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29708.html > > Interesting reading. Thanks. > > > Overly optimistic. > > In an idle cluster with synthetic tests you might get sequential reads > > that are around 150MB/s per HDD. > > As for writes, think 80MB/s, again in an idle cluster. > > > > Any realistic, random I/O and you're looking at 50MB/s at most either > > way. > > > > So your storage nodes can't really saturate even a single 10Gb/s link > > in real life situations. > > Ok. > > > Journal SSDs can improve on things, but that's mostly for IOPS. > > In fact they easily become the bottleneck bandwidth wise and are so on > > most of my storage nodes. > > Because you'd need at least 2 400GB DC S3710 SSDs to get around 1GB/s > > writes, or one link worth. > > I plan to use 1 or 2 SSD journal (probably, 1 SSD every 6 spinning disks) > That's as large as I would make that failure domain, also make sure to choose SSDs that work well with Ceph, endurance and sync write speed wise (lots of threads about this). > > Splitting things in cluster and public networks ONLY makes sense when > > your storage node can saturate ALL the network bandwidth, which > > usually is only the case when it comes to very expensive SSD/NVMe only > > nodes. > > This is not my case. > > > Going back to your original post, with a split network the latency in > > both networks counts the same, as a client write will NOT be > > acknowledged until it has reach the journal of all replicas, so having > > a higher latency cluster network is counterproductive. > > Ok. > > > Or if you can start with a clean slate (including the clients), look at > > Infiniband. > > All my production clusters are running entirely IB (IPoIB currently) > > and I'm very happy with the performance, latency and cost. > > Yes, i'll start with a brand new network. Alas you're not really, as you say below your clients already have 10GigE ports. So I'll be terse from here. > Acutally i'm testing with some old IB switches (DDR) and i'm not very > happy, as IPoIB doesn't go over 8/9Gbit/s in a DDR. You should have gotten some insight out of your "RDMA/Infiniband status" thread, but I never bothered with DDR. > Additionally, CX4 > cables used by DDR are... HUGE and very "hard" to bend in the rack. > I don't know if QDR cables are thinner. > Nope, but then again some of the 10GigE cables are rather stiff or fragile, too. > Are you using QDR? Yes, because it's cheaper, we don't need the bandwidth and the latency is supposedly lower than FDR. > I've seen a couple of mellanox used switches on ebay > that seems to be ok for me. 36 QDR ports would be awesome but I don't > have any IB knowledge. Largest switch we use is 18 ports, our clusters are small. > Could I keep the IB fabric unconfigured and use only IPoIB ? Pretty much, a very basic OpenSM config and you're good to go. > I can create a bonded (failover) IPoIB device on each node and add 2 or > more IB cables between both switches. In a normal Ethernet network, > these 2 cables must be joined in a LAG to avoid loops. Is infiniband > able to manage this on their > own ? Yes. The basic/default OpenSM router needs to be told to use more than one path if possible "lmc 2", other routers are also available and have more bells and whistles. >I've never find a way to aggragate multiple ports. Re-read that OSPF thread... IB has means to do this, alas IPoIB bonding only supports failover at this time, yes. > The real drawback with IB is that I have to add IB cards on each compute > nodes, where my current compute nodes a 2 10GBaseT ports onboard. > > This add some costs.... > Then look at the 10GigE options I listed. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com