Re: rdma performance verification

"Liu, Changcheng" <changcheng.liu@xxxxxxxxx> · Tue, 22 Oct 2019 20:05:56 +0800

On 16:30 Mon 21 Oct, Doug Ledford wrote:
> On Mon, 2019-09-16 at 17:42 +0800, Liu, Changcheng wrote:
> > Hi all,
> >    I'm working on using rdma to improve message transaction
> qperf is nice because it will do both the tcp and rdma testing, so the
> same set of options will make it behave the same way under both tests.
@Doug Ledford:
   I'll check how to use it to compare RDMA & TCP.
> 
> I think you are mis-reading the instructions on ib_send_bw.  First of
> all, IB RC queue pairs are, when used in send/recv mode, message passing
> devices, not a stream device.  When you specified the -s parameter of

@Doug Ledford:
   What's the difference between "message passing device" and "stream
   device"?

> 1GB, you were telling it to use messages of 1GB in size, not to pass a
> total of 1GB of messages.  And the default number of messages to pass is
> 1,000 iterations (the -n or --iters options), so you were actually
@Doug Ledford:
   Thanks for your information. It helps me a lot.
> testing a 1,000GB transfer.  You would be better off to use a smaller
> message size and then set the iters to the proper value.  This is what I
> got with 1000 iters and 1GB message size:
> 
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>  1073741824    1000             6159.64            6159.46		   0.000006
> ---------------------------------------------------------------------------------------
> 
> real	3m3.101s
> user	3m2.430s
> sys	0m0.450s
> 
> I tried dropping it to 1 iteration to make a comparison, but that's not
> even allowed by ib_send_bw, it wants a minimum of 5 iterations.  So I
> did 8 iterations at 1/8th GB in size and this is what I got:
> 
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>  134217728    8                6157.54            6157.54		   0.000048
> ---------------------------------------------------------------------------------------
> 
> real	0m2.506s
> user	0m2.411s
> sys	0m0.059s
> 
> When I adjust that down to 1MB and 1024 iters, I get:
> 
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>  1048576    1024             6157.74            6157.74		   0.006158
> ---------------------------------------------------------------------------------------
> 
> real	0m0.427s
> user	0m0.408s
> sys	0m0.002s
> 
> The large difference in time between these last two tests, given that
> the measured bandwidth is so close to identical, explains the problem
> you are seeing below.
> 
> The ib_send_bw test is a simple test.  It sets up a buffer by
> registering its memory, then just slams that buffer over the wire.  With
> a 128MB buffer, you pay a heavily memory registration penalty.  That's
> factored into the 2s time difference between the two runs.  When you use
> a 1GB buffer, the delay is noticeable to the human eye.  There is a very
> visible pause as the server and client start their memory registrations.
@Doug Ledford:
   Do you mean that every RDMA-SGE(Scatter/Gather element) will use
   seperate MR(Memory Region)?
   If all the RDMA-SGE use only one pre-allocated MR-1GB, the two tests
   shouldn't have so much time consuming difference.

> 
> >    In Ceph, the result shows that rdma performance (RC transaction
> > type,
> >    SEDN operation) is worse or not much better than TCP implemented
> > performance.
> >    Test A:
> >       1 client thread send 20GB data to 1 server thread (marked as
> > 1C:1S)
> >    Result:
> >       1) implementation based on RDMA
> >          Take 171.921294s to finish send 20GB data.
> >       2) implementation based on TCP
> >          Take 62.444163s to finish send 20GB data.
> > 
> >    Test B:
> >       16 client threads send 16x20GB data to 1 server thread (marked
> > as 16C:1S)
> >    Result:
> >       1) implementation base on RDMA
> >          Take 261.285612s to finish send 16x20GB data.
> >       2) implementation based on TCP
> >          Take 318.949126 to finish send 16x20GB data.
> 
> I suspect your performance problems here are memory registrations.  As
> noted by Chuck Lever in some of his recent postings, memory
> registrations can end up killing performance for small messages, and the
> tests I've shown here indicate, they're also a killer for huge memory
> blocks if they are repeatedly registered/deregistered.  TCP has no

I think we could pre-registered 1GB MR and then all the SGE share with the
same MR, then it could mitigate the penalty in register/deregister.

> memory registration overhead, so in the single client case, it is
> outperforming the RDMA case.  But in the parallel case with lots of
> clients, the memory registration overhead is spread out among many
> clients, so we are able to perform better overall.

In Ceph implementation, all the threads in the same process share with
the same pre-registered 1GB MR. The MR is divided into lots of chunks to
be used as SGE.
In this way, how to explain the test result between Test-A & Test-B?

> 
> In a nutshell, it sounds like the Ceph transfer engine over RDMA is not
> optimized at all, and is hitting problems with memory registration
> overhead.
Ceph/RDMA seems not widely used and some implementation need to be
optimized. I'm going to work on it in future.
> 
> -- 
> Doug Ledford <dledford@xxxxxxxxxx>
>     GPG KeyID: B826A3330E572FDD
>     Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD