Re: SSD journal suggestion

Joseph Glanville <joseph.glanville@xxxxxxxxxxxxxx> · Fri, 9 Nov 2012 07:12:45 +1100

On 9 November 2012 02:00, Atchley, Scott <atchleyes@xxxxxxxx> wrote:
> On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:
>
>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@xxxxxxxxx> wrote:
>>>
>>>> 2012/11/8 Mark Nelson <mark.nelson@xxxxxxxxxxx>:
>>>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it
>>>>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>>>>> probably speak more authoritatively.  Even with RDMA you are going to top
>>>>> out at around 3.1-3.2GB/s.
>>>>
>>>> 15Gb/s is still faster than 10Gbe
>>>> But this speed limit seems to be kernel-related and should be the same
>>>> even in a 10Gbe environment, or not?
>>>
>>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding.
>>>
>>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s.
>>
>> Scott, this is very interesting!  Does setting the interrupt affinity
>> make the biggest difference then when you have concurrent netperf
>> processes going?  For some reason I thought that setting interrupt
>> affinity wasn't even guaranteed in linux any more, but this is just some
>> half-remembered recollection from a year or two ago.
>
> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity:
>
> Default (irqbalance running)   12.8 Gb/s
> IRQ balance off                13.0 Gb/s
> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
>
> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream.
>
>>> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf:
>>>
>>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>>>
>>> We looked at their interrupt affinity setting scripts and then wrote our own.
>>>
>>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it.
>>>
>>> We are getting a new test cluster with FDR HCAs and I will look into those as well.
>>
>> Nice!  At some point I'll probably try to justify getting some FDR cards
>> in house.  I'd definitely like to hear how FDR ends up working for you.
>
> I'll post the numbers when I get access after they are set up.
>
> Scott
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

If you are running Ceph purely in userspace you could try using rsockets.
rsockets is a pure userspace implementation of sockets over RDMA. It
has much much lower latency and close to native throughput.
My guess is rsockets will probably work perfectly and should give you
95% of theoretical max performance.

I would like to see a somewhat native implementation of RDMA in Ceph one day.
I was doing some preliminary work on it 1.5 years ago when Ceph was
first gaining traction but we didn't end up putting our focus on Ceph
and as such I never got anywhere with it.
In theory one only needs to use RDMA for the fast path to gain alot of
benefit. This can be done even in the RBD kernel module with the
RDMA-CM which will interact nicely across kernelspace and userspace
(they actually share he same API thankfully).

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html