Joseph, I've downloaded and read the presentation from 'Sean Hefty / Intel Corporation' about rsockets, which sounds very promising to me. Can you please teach me how to get access to the rsockets source ? Thanks, -Dieter On Thu, Nov 08, 2012 at 09:12:45PM +0100, Joseph Glanville wrote: > On 9 November 2012 02:00, Atchley, Scott <atchleyes@xxxxxxxx> wrote: > > On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote: > > > >> On 11/08/2012 07:55 AM, Atchley, Scott wrote: > >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@xxxxxxxxx> wrote: > >>> > >>>> 2012/11/8 Mark Nelson <mark.nelson@xxxxxxxxxxx>: > >>>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it > >>>>> tends to top out at like 15Gb/s. Some others on this mailing list can > >>>>> probably speak more authoritatively. Even with RDMA you are going to top > >>>>> out at around 3.1-3.2GB/s. > >>>> > >>>> 15Gb/s is still faster than 10Gbe > >>>> But this speed limit seems to be kernel-related and should be the same > >>>> even in a 10Gbe environment, or not? > >>> > >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. > >>> > >>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. > >> > >> Scott, this is very interesting! Does setting the interrupt affinity > >> make the biggest difference then when you have concurrent netperf > >> processes going? For some reason I thought that setting interrupt > >> affinity wasn't even guaranteed in linux any more, but this is just some > >> half-remembered recollection from a year or two ago. > > > > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: > > > > Default (irqbalance running) 12.8 Gb/s > > IRQ balance off 13.0 Gb/s > > Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script > > > > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. > > > >>> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: > >>> > >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf > >>> > >>> We looked at their interrupt affinity setting scripts and then wrote our own. > >>> > >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. > >>> > >>> We are getting a new test cluster with FDR HCAs and I will look into those as well. > >> > >> Nice! At some point I'll probably try to justify getting some FDR cards > >> in house. I'd definitely like to hear how FDR ends up working for you. > > > > I'll post the numbers when I get access after they are set up. > > > > Scott > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > If you are running Ceph purely in userspace you could try using rsockets. > rsockets is a pure userspace implementation of sockets over RDMA. It > has much much lower latency and close to native throughput. > My guess is rsockets will probably work perfectly and should give you > 95% of theoretical max performance. > > I would like to see a somewhat native implementation of RDMA in Ceph one day. > I was doing some preliminary work on it 1.5 years ago when Ceph was > first gaining traction but we didn't end up putting our focus on Ceph > and as such I never got anywhere with it. > In theory one only needs to use RDMA for the fast path to gain alot of > benefit. This can be done even in the RBD kernel module with the > RDMA-CM which will interact nicely across kernelspace and userspace > (they actually share he same API thankfully). > > Joseph. > > -- > CTO | Orion Virtualisation Solutions | www.orionvm.com.au > Phone: 1300 56 99 52 | Mobile: 0428 754 846 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html