Re: [RFC/PATCH net-next 0/9] net/dim: Support for multiple implementations

Yamin Friedman <yaminf@xxxxxxxxxxxx> · Mon, 18 Mar 2019 09:24:50 +0000

On 3/14/2019 1:45 PM, Max Gurtovoy wrote:
>
> On 3/7/2019 3:56 AM, Sagi Grimberg wrote:
>>
>>>> net_dim.h lib exposes an implementation of the DIM algorithm for 
>>>> dynamically-tuned interrupt
>>>> moderation for networking interfaces.
>>>>
>>>> We need the same behavior for any block CQ. The main motivation is 
>>>> two benefit from maximized
>>>> completion rate and reduced interrupt overhead that DIM may provide.
>>>
>>> What is a "block CQ"?
>>
>> There is no such thing... Also, this has no difference
>> if a block/file/whatever is using the rdma cq.
>>
>> The naming should really be something like rdma_dim as it accounts
>> for completions and not bytes/packets.
>
> Sagi,
>
> I think that in the future we could use it in nvme since there is an 
> option to set the interrupt coalescing in NVMe spec.
>
> This might improve performance for NVMe driver.
>
> We already see some bottlenecks in performance (maybe driver ones) 
> while developing the NVMe SNAP feature in Bluefield (NVMe emulation 
> using Smart NIC).
>
> We're trying to get 2.5-2.7 MIOPs OOB from 1 controller and it's not 
> trivial for today's driver.
>
> So let's take this into consideration when we set the naming.
>
>
I agree that blk is not the most successful name, we were trying to find 
something that would work for general storage applications. I think 
rdma_dim would work as it is completion based but then when we want to 
use it for nvme it will probably require code duplication.

>>
>>
>>> How does net_dim compare to lib/irq_poll?
>>
>> Its orthogonal, its basically adaptive interrupt moderation for
>> RDMA devices. Its sort of below the irq_poll code. It basically
>> configures interrupt moderation based on stats collected by
>> the rdma driver.
>>
>>> Which approach results in the best performance and lowest latency?
>>
>> I guess it depends on what is the test case. This approach tries to
>> apply some time or completion count limit to when the HW should fire
>> an interrupt based on the load in an adaptive fashion.
>>
>> The scheme is to try and detect what are the load characteristics and
>> come up with a moderation parameters that fit. For high interrupt rate
>> (usually seen with small size high queue-depth workloads) it configures
>> the device to aggregate some more before firing an interrupt - so less
>> interrupts, better efficiency per interrupt (finds more completions).
>> For low interrupt rate (low queue depth) the load is probably low to
>> moderate and aggregating before firing an interrupt is just added
>> latency for no benefit. So the algorithm tries to transition between a
>> number of pre-defined levels according to the load it samples.
>>
>> This has been widely used by the network drivers for the past decade.
>>
>> Now, this algorithm while trying to adjust itself by learning the load,
>> also adds entropy to the overall system performance and latency.
>> So this is not a trivial trade-off for any workload.
>>
>> I took a stab at this once (came up with something very similar),
>> and while for large queue-depth workloads I got up to 2x IOPs as the
>> algorithm chose aggressive moderation parameters which improved the
>> efficiency a lot, but when the workload varied the algorithm wasn't very
>> successful detecting the load and the step direction (I used a variation
>> of the same basic algorithm from mlx5 driver that net_dim is based on).
>>
>> Also, QD=1 resulted in higher latency as the algorithm was dangling
>> between the two lowest levels. So I guess this needs to undergo a
>> thorough performance evaluation for steady and varying workloads before
>> we can consider this.
>>
>> Overall, I think its a great idea to add that to the rdma subsystem
>> but we cannot make it the default and especially without being able
>> to turn it off. So this needs to be opt in with a sysctl option.
>
> We can add flag in create_cq command that will 
> try_coalescing_is_possible instead of module parameter of course.
>
> Storage ULPs can set it to True.
>
> Also in the internal review Yamin added a table that summarize all the 
> testing that were done using NVMeoF (I guess it somehow didn't get to 
> this RFC).
>
> I guess we can do the same for iSER to get more confidence and then 
> set both to create modifiable cq (if HCA supports, of course).
>
> Agreed ?
>
I think that adding a flag in create_cq will be less clean as it will 
require more work for anyone writing applications that should not have 
to consider this feature.

Based on the results I saw during testing I would set it to work by 
default as I could not find a use case where it significantly reduces 
performance and in many cases it is a large improvement. It should be 
more of an opt out situation.

Performance improvement (ConnectX-5 100GbE, x86) running FIO benchmark over
     NVMf between two equal end-hosts with 56 cores across a Mellanox switch
     using null_blk device:

     IO READS before:
     blk size | BW      | IOPS | 99th percentile latency
     512B     | 3.2GiB  | 6.6M | 1549  usec
     4k       | 7.2GiB  | 1.8M | 7177  usec
     64k      | 10.7GiB | 176k | 82314 usec

     IO READS after:
     blk size | BW      | IOPS | 99th percentile latency
     512B     | 4.2GiB  | 8.6M | 1729   usec
     4k       | 10.5GiB | 2.7M | 5669   usec
     64k      | 10.7GiB | 176k | 102000 usec

     IO WRITES before:
     blk size | BW      | IOPS | 99th percentile latency
     512B     | 3GiB    | 6.2M | 2573  usec
     4k       | 7.2GiB  | 1.8M | 5342  usec
     64k      | 10.7GiB | 176k | 62129 usec

     IO WRITES after:
     blk size | BW      | IOPS  | 99th percentile latency
     512B     | 4.2GiB  | 8.6M  | 938   usec
     4k       | 10.2GiB | 2.68M | 2769  usec
     64k      | 10.6GiB | 173k  | 87557 usec

It doesn't really make a difference to me how the option is implemented 
but I think it makes more sense to have it dealt with by us such as in a 
module parameter and not something like a flag that has a larger radius 
of effect.

>>
>>
>> Moreover, not every device support cq moderation so you need to check
>> the device capabilities before you apply any of this.
>
> for sure.
>
>