Re: SSD journal suggestion / rsockets

Dieter Kasper <d.kasper@xxxxxxxxxxxx> · Thu, 8 Nov 2012 22:21:30 +0100

Joseph,

I've downloaded and read the presentation from 'Sean Hefty / Intel Corporation'
about rsockets, which sounds very promising to me.
Can you please teach me how to get access to the rsockets source ?

Thanks,
-Dieter

On Thu, Nov 08, 2012 at 09:12:45PM +0100, Joseph Glanville wrote:
> On 9 November 2012 02:00, Atchley, Scott <atchleyes@xxxxxxxx> wrote:
> > On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:
> >
> >> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
> >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@xxxxxxxxx> wrote:
> >>>
> >>>> 2012/11/8 Mark Nelson <mark.nelson@xxxxxxxxxxx>:
> >>>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it
> >>>>> tends to top out at like 15Gb/s.  Some others on this mailing list can
> >>>>> probably speak more authoritatively.  Even with RDMA you are going to top
> >>>>> out at around 3.1-3.2GB/s.
> >>>>
> >>>> 15Gb/s is still faster than 10Gbe
> >>>> But this speed limit seems to be kernel-related and should be the same
> >>>> even in a 10Gbe environment, or not?
> >>>
> >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding.
> >>>
> >>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s.
> >>
> >> Scott, this is very interesting!  Does setting the interrupt affinity
> >> make the biggest difference then when you have concurrent netperf
> >> processes going?  For some reason I thought that setting interrupt
> >> affinity wasn't even guaranteed in linux any more, but this is just some
> >> half-remembered recollection from a year or two ago.
> >
> > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity:
> >
> > Default (irqbalance running)   12.8 Gb/s
> > IRQ balance off                13.0 Gb/s
> > Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
> >
> > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream.
> >
> >>> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf:
> >>>
> >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
> >>>
> >>> We looked at their interrupt affinity setting scripts and then wrote our own.
> >>>
> >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it.
> >>>
> >>> We are getting a new test cluster with FDR HCAs and I will look into those as well.
> >>
> >> Nice!  At some point I'll probably try to justify getting some FDR cards
> >> in house.  I'd definitely like to hear how FDR ends up working for you.
> >
> > I'll post the numbers when I get access after they are set up.
> >
> > Scott
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> If you are running Ceph purely in userspace you could try using rsockets.
> rsockets is a pure userspace implementation of sockets over RDMA. It
> has much much lower latency and close to native throughput.
> My guess is rsockets will probably work perfectly and should give you
> 95% of theoretical max performance.
> 
> I would like to see a somewhat native implementation of RDMA in Ceph one day.
> I was doing some preliminary work on it 1.5 years ago when Ceph was
> first gaining traction but we didn't end up putting our focus on Ceph
> and as such I never got anywhere with it.
> In theory one only needs to use RDMA for the fast path to gain alot of
> benefit. This can be done even in the RBD kernel module with the
> RDMA-CM which will interact nicely across kernelspace and userspace
> (they actually share he same API thankfully).
> 
> Joseph.
> 
> -- 
> CTO | Orion Virtualisation Solutions | www.orionvm.com.au
> Phone: 1300 56 99 52 | Mobile: 0428 754 846
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html