Re: zero-copy between interfaces

Maxim Mikityanskiy <maximmi@xxxxxxxxxxxx> · Mon, 27 Jan 2020 14:01:48 +0000

On 2020-01-22 23:43, Ryan Goodfellow wrote:
> On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
>>>> I've posted output from the program in debugging mode here
>>>>
>>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930375
>>>>
>>>> Yes, you are correct in that forwarding works for a brief period and then stops.
>>>> I've noticed that the number of packets that are forwarded is equal to the size
>>>> of the producer/consumer descriptor rings. I've posted two ping traces from a
>>>> client ping that shows this.
>>>>
>>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930376
>>>> - https://gitlab.com/mergetb/tech/network-emulation/kernel/snippets/1930377
>>
>> These snippets are not available.
> 
> Apologies, I had the wrong permissions set. They should be available now.
> 
>>
>>>>
>>>> I've also noticed that when the forwarding stops, the CPU usage for the proc
>>>> running the program is pegged, which is not the norm for this program as it uses
>>>> a poll call with a timeout on the xsk fd.
>>
>> This information led me to a guess what may be happening. On the RX
>> side, mlx5e allocates pages in bulks for performance reasons and to
>> leverage hardware features targeted to performance. In AF_XDP mode,
>> bulking of frames is also used (on x86, the bulk size is 64 with
>> striding RQ enabled, and 8 otherwise, however, it's implementation
>> details that might change later). If you don't put enough frames to XSK
>> Fill Ring, the driver will be demanding more frames and return from
>> poll() immediately. Basically, in the application, you should put as
>> many frames to the Fill Ring as you can. Please check if that could be
>> the root cause of your issue.
> 
> The code in this application makes an effort to relenish the fill ring as fast
> as possible. The basic loop of the application is to first check if there are
> any descriptors to be consumed from the completion queue or any descriptors that
> can be added to the fill queue, and only then to move on to moving packets
> through the rx and tx rings.
> 
> https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-moa/samples/bpf/xdpsock_multidev.c#L452-474

I reproduced your issue and did my investigation, and here is what I found:

1. Commit df0ae6f78a45 (that you found during bisect) introduces an 
important behavioral change (which I thought was not that important). 
xskq_nb_avail used to return min(entries, dcnt), but after the change it 
just returns entries, which may be as big as the ring size.

2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail 
catches up with q->cons_head. So, before that patch and one previous 
patch, cons_head - cons_tail was not more than 16, so the consumer index 
was updated periodically. Now consumer is updated only when the whole 
ring is exhausted.

3. The application can't replenish the fill ring if the consumer index 
doesn't move. As a consequence, refilling the descriptors by the 
application can't happen in parallel with using them by the driver. It 
should have some performance penalty and possibly even lead to packet 
drops, because the driver uses all the descriptors and only then 
advances the consumer index, and then it has to wait until the 
application refills the ring, busy-looping and losing packets.

4. As mlx5e allocates frames in batches, the consequences are even more 
severe: it's a deadlock where the driver waits for the application, and 
vice versa. The driver never reaches the point where cons_tail gets 
equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch 
size requested by the driver is 8, the driver won't peek anything from 
the fill ring waiting for difference between cons_tail and cons_head to 
increase to be at least 8. On the other hand, the application can't put 
anything to the ring, because it still thinks that the consumer index is 
0. As cons_tail never reaches cons_head, the consumer index doesn't get 
updated, hence the deadlock.

So, in my vision, the decision to remove RX_BATCH_SIZE and periodic 
updates of the consumer index was wrong. It totally breaks mlx5e, that 
does batching, and it will affect the performance of any driver, because 
the application can't refill the ring until it gets completely empty and 
the driver starts waiting for frames. I suggest that periodic updates of 
the consumer index should be readded to xskq_cons_peek_addr.

Magnus, what do you think of the suggestion above?

Thanks,
Max

>>
>> I tracked this issue in our internal bug tracker in case we need to
>> perform actual debugging of mlx5e. I'm looking forward to your feedback
>> on my assumption above.
>>
>>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-CCAT)
>>>> running the mlx5 driver.
>>
>> This one should run without striding RQ, please verify it with ethtool
>> --show-priv-flags (the flag name is rx_striding_rq).
> 
> I do not remember changing this option, so whatever the default is, is what it
> was running with. I am traveling this week and do not have access to these
> systems, but will ensure that this flag is set properly when I get back.
>