Re: ib_post_send execution time

Evgenii Smirnov <evgenii.smirnov@xxxxxxxxxxxxxxxx> · Tue, 28 Oct 2014 17:49:38 +0100

I forgot to mention that in both cases flag IB_SEND_INLINE (or
IBV_SEND_INLINE) is cleared.
Below are the results from perf top. They are quite equal to the
results of perf record. The functions that consumes almost 50% of
cycles is my function. Essentially, all it does is calling
ib_post_send with already predefined structs ib_send_wr and ib_sge.
Still have no idea why mlx4_ib_post_send consumes such big amount of
cycles.

  48.93%  /proc/kcore  0x7fffa078774c      k [k] test_send
  35.20%  /proc/kcore  0x7fffa0317a99      k [k] mlx4_ib_post_send
   6.39%  /proc/kcore  0x7fff8150cd3f      k [k] _raw_spin_lock_irqsave
   3.85%  /proc/kcore  0x7fffa0313e0f      k [k] stamp_send_wqe
   2.03%  /proc/kcore  0x7fff8150c9de      k [k] _raw_spin_unlock_irqrestore
   1.80%  /proc/kcore  0x7fffa05fc7a9      k [k] client_send
   1.13%  /proc/kcore  0x7fff81086a75      k [k] kthread_should_stop
   0.18%  /proc/kcore  0x7fffa03080d8      k [k] mlx4_ib_poll_cq
   0.17%  /proc/kcore  0x7fffa0787b80      k [k] process_wc
   0.14%  /proc/kcore  0x7fffa0307403      k [k] get_sw_cqe
   0.02%  /proc/kcore  0x7fff8150dcd0      k [k] irq_entries_start
   0.02%  /proc/kcore  0x7fffa017bf37      k [k] eq_set_ci.isra.14
   0.02%  /proc/kcore  0x7fffa017c4d6      k [k] mlx4_eq_int
   0.01%  /proc/kcore  0x7fff8150cd7e      k [k] _raw_spin_lock
   0.01%  /proc/kcore  0x7fffa017b106      k [k] mlx4_cq_completion
   0.01%  /proc/kcore  0x7fff8150e1f0      k [k] apic_timer_interrupt
   0.01%  /proc/kcore  0x7fffa0308a5d      k [k] mlx4_ib_arm_cq
   0.01%  /proc/kcore  0x7fff81051a86      k [k] native_read_msr_safe
   0.01%  /proc/kcore  0x7fffa017d07b      k [k] mlx4_msi_x_interrupt
   0.01%  /proc/kcore  0x7fff81051aa6      k [k] native_write_msr_safe
   0.01%  /proc/kcore  0x7fff8138c83e      k [k] add_interrupt_randomness
   0.00%  /proc/kcore  0x7fff8101c312      k [k] native_read_tsc
   0.00%  /proc/kcore  0x7fff810bc6d0      k [k] handle_edge_irq
   0.00%  /proc/kcore  0x7fff8150de92      k [k] common_interrupt
   0.00%  /proc/kcore  0x7fff8106ba3c      k [k] raise_softirq
   0.00%  /proc/kcore  0x7fff812accd0      k [k] __radix_tree_lookup
   0.00%  /proc/kcore  0x7fff81072ebc      k [k] run_timer_softirq
   0.00%  /proc/kcore  0x7fff81094721      k [k] idle_cpu

On Fri, Oct 24, 2014 at 5:52 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> wrote:
> On 10/24/2014 6:30 AM, Or Gerlitz wrote:
>>
>> On Fri, Oct 24, 2014 at 3:39 AM, Eli Cohen <eli@xxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> On Thu, Oct 23, 2014 at 11:45:05AM -0700, Roland Dreier wrote:
>>>>
>>>> On Thu, Oct 23, 2014 at 10:21 AM, Evgenii Smirnov
>>>> <evgenii.smirnov@xxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> I am trying to achieve high packet per second throughput with 2-byte
>>>>> messages over Infiniband from kernel using IB_SEND verb. The most I
>>>>> can get so far is 3.5 Mpps. However, ib_send_bw utility from perftest
>>>>> package is able to send 2-byte packets with rate of 9 Mpps.
>>>>> After some profiling I found that execution of ib_post_send function
>>>>> in kernel takes about 213 ns in average, for the user-space function
>>>>> ibv_post_send takes only about 57 ns.
>>>>> As I understand, these functions do almost same operations. The work
>>>>> request fields and queue pair parameters are also the same. Why do
>>>>> they have such big difference in execution times?
>>>>
>>>>
>>>> Interesting.  I guess it would be useful to look at perf top / and or
>>>> get a perf report with "perf report -a -g" when running your high PPS
>>>> workload, and see where the time is wasted.
>>>>
>>> I assume ib_send_bw uses inline with blueflame so it may be part of
>>> the explanation to the differences you see.
>>
>> I think it should be the other way around... when we use inline we
>> consume more CPU cycles and here we see notable different (213ns --
>> kernel 57ns user) in favor of libmlx4
>>
>
> Inline may consume more cpu cycles but should reduce latency because the IO
> is completed with only 1 DMA transaction, the WR fetch, which includes the
> data.  Non-inline requires 2 DMA transactions, the WR fetch and the data
> fetch.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html