On Thu, May 3, 2018 at 3:51 PM, Alex Rosenbaum <rosenbaumalex@xxxxxxxxx> wrote: > Rohit, > > Do you have addition benchmark results for each patch stage of the > locking and bfreg changes you submitted? I don't have incremental numbers, yet. I will have the kernel with the dynamic UAR allocation support only next week. I will report numbers then. > How many CQ's did you use? 1 per thread? or 1 per QP? For the experiment in this thread, I used 1 "active" QP per thread. By "active" I mean the QP the thread was using in method (3). In methods (1) and (2), all QPs are "active" QPs. I used 1 CQ per thread which translates to 1 CQ per "active" QP. -Rohit > Alex > > > On Wed, Apr 18, 2018 at 4:11 PM, Rohit Zambre <rzambre@xxxxxxx> wrote: >> Hi, >> >> I am sending 2-byte RDMA-write message rates with 16 threads from a >> sender to a receiver. I am sending 1 message every ibv_post_send with >> inlining; so mlx5_bf_copy() is used. I am measuring message rate using >> two configurations: (1) 16 Contexts with 1 QP per Context; (2) 8 >> Contexts with 2 QPs per Context. The message rate in (2) is 15% lower >> than that in (1) and I am trying to understand why this is the case. >> >> In (2), I learned that I can eliminate the 15% drop by creating 4 QPs >> per Context but use only QP_0 and QP_2 within each Context. Yes, this >> is hacky but the purpose is to understand behavior. This is method >> (3). >> >> The difference between (2) and (3) is that the QPs being used in (3) >> are on different UAR pages, same as in (1). In (2), the QPs are on the >> same UAR page. >> >> The number of sfence barriers is the same in all cases. In (2) the >> threads are calling sfence on memory that lies on the same UAR page >> while in (1)/(3), the threads are calling sfence on memory that lies >> on different UAR pages. mlx5_bf_copy() writes 64 bytes, the size of a >> WC buffer. >> >> One theory to explain the 15% drop is that there is only 1 WC buffer >> per UAR page; since the WC buffers maintain states like caches, if the >> 1 WC buffer is being flushed it cannot be modified by the other thread >> writing to the same UAR page. So mlx5_bf_copy of each thread in (2) is >> serialized by the sfence flush. But my understanding is that multiple >> WC buffers exist per core and I am not sure which system-layer maps >> WCs to pages. Could someone confirm the number of WC buffers per UAR >> page or point me to where I should be looking to find out? >> >> Thanks, >> -Rohit Zambre >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html