Re: zero-copy between interfaces

Magnus Karlsson <magnus.karlsson@xxxxxxxxx> · Wed, 5 Feb 2020 14:31:31 +0100

On Tue, Feb 4, 2020 at 5:10 PM Magnus Karlsson
<magnus.karlsson@xxxxxxxxx> wrote:
>
> On Thu, Jan 30, 2020 at 12:40 PM Magnus Karlsson
> <magnus.karlsson@xxxxxxxxx> wrote:
> >
> > On Thu, Jan 30, 2020 at 10:59 AM Magnus Karlsson
> > <magnus.karlsson@xxxxxxxxx> wrote:
> > >
> > > On Thu, Jan 30, 2020 at 10:37 AM Maxim Mikityanskiy
> > > <maximmi@xxxxxxxxxxxx> wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
> > > > > Sent: 27 January, 2020 17:55
> > > > > To: Maxim Mikityanskiy <maximmi@xxxxxxxxxxxx>
> > > > > Cc: Ryan Goodfellow <rgoodfel@xxxxxxx>; xdp-newbies@xxxxxxxxxxxxxxx; Tariq
> > > > > Toukan <tariqt@xxxxxxxxxxxx>; Saeed Mahameed <saeedm@xxxxxxxxxxxx>; Moshe
> > > > > Shemesh <moshe@xxxxxxxxxxxx>
> > > > > Subject: Re: zero-copy between interfaces
> > > > >
> > > > > On Mon, Jan 27, 2020 at 3:01 PM Maxim Mikityanskiy <maximmi@xxxxxxxxxxxx>
> > > > > wrote:
> > > > > >
> > > > > > On 2020-01-22 23:43, Ryan Goodfellow wrote:
> > > > > > > On Tue, Jan 21, 2020 at 01:40:50PM +0000, Maxim Mikityanskiy wrote:
> > > > > > >>>> I've posted output from the program in debugging mode here
> > > > > > >>>>
> > > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > > emulation/kernel/snippets/1930375
> > > > > > >>>>
> > > > > > >>>> Yes, you are correct in that forwarding works for a brief period and
> > > > > then stops.
> > > > > > >>>> I've noticed that the number of packets that are forwarded is equal
> > > > > to the size
> > > > > > >>>> of the producer/consumer descriptor rings. I've posted two ping
> > > > > traces from a
> > > > > > >>>> client ping that shows this.
> > > > > > >>>>
> > > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > > emulation/kernel/snippets/1930376
> > > > > > >>>> - https://gitlab.com/mergetb/tech/network-
> > > > > emulation/kernel/snippets/1930377
> > > > > > >>
> > > > > > >> These snippets are not available.
> > > > > > >
> > > > > > > Apologies, I had the wrong permissions set. They should be available
> > > > > now.
> > > > > > >
> > > > > > >>
> > > > > > >>>>
> > > > > > >>>> I've also noticed that when the forwarding stops, the CPU usage for
> > > > > the proc
> > > > > > >>>> running the program is pegged, which is not the norm for this program
> > > > > as it uses
> > > > > > >>>> a poll call with a timeout on the xsk fd.
> > > > > > >>
> > > > > > >> This information led me to a guess what may be happening. On the RX
> > > > > > >> side, mlx5e allocates pages in bulks for performance reasons and to
> > > > > > >> leverage hardware features targeted to performance. In AF_XDP mode,
> > > > > > >> bulking of frames is also used (on x86, the bulk size is 64 with
> > > > > > >> striding RQ enabled, and 8 otherwise, however, it's implementation
> > > > > > >> details that might change later). If you don't put enough frames to XSK
> > > > > > >> Fill Ring, the driver will be demanding more frames and return from
> > > > > > >> poll() immediately. Basically, in the application, you should put as
> > > > > > >> many frames to the Fill Ring as you can. Please check if that could be
> > > > > > >> the root cause of your issue.
> > > > > > >
> > > > > > > The code in this application makes an effort to relenish the fill ring
> > > > > as fast
> > > > > > > as possible. The basic loop of the application is to first check if
> > > > > there are
> > > > > > > any descriptors to be consumed from the completion queue or any
> > > > > descriptors that
> > > > > > > can be added to the fill queue, and only then to move on to moving
> > > > > packets
> > > > > > > through the rx and tx rings.
> > > > > > >
> > > > > > > https://gitlab.com/mergetb/tech/network-emulation/kernel/blob/v5.5-
> > > > > moa/samples/bpf/xdpsock_multidev.c#L452-474
> > > > > >
> > > > > > I reproduced your issue and did my investigation, and here is what I
> > > > > found:
> > > > > >
> > > > > > 1. Commit df0ae6f78a45 (that you found during bisect) introduces an
> > > > > > important behavioral change (which I thought was not that important).
> > > > > > xskq_nb_avail used to return min(entries, dcnt), but after the change it
> > > > > > just returns entries, which may be as big as the ring size.
> > > > > >
> > > > > > 2. xskq_peek_addr updates q->ring->consumer only when q->cons_tail
> > > > > > catches up with q->cons_head. So, before that patch and one previous
> > > > > > patch, cons_head - cons_tail was not more than 16, so the consumer index
> > > > > > was updated periodically. Now consumer is updated only when the whole
> > > > > > ring is exhausted.
> > > > > >
> > > > > > 3. The application can't replenish the fill ring if the consumer index
> > > > > > doesn't move. As a consequence, refilling the descriptors by the
> > > > > > application can't happen in parallel with using them by the driver. It
> > > > > > should have some performance penalty and possibly even lead to packet
> > > > > > drops, because the driver uses all the descriptors and only then
> > > > > > advances the consumer index, and then it has to wait until the
> > > > > > application refills the ring, busy-looping and losing packets.
> > > > >
> > > > > This will happen if user space always fills up the whole ring, which
> > > > > might or might not happen all depending on the app.
> > > >
> > > > Yes, that's right, and as far as I know, it's common to fill as many
> > > > frames as the application can (there was no reason to do it other way
> > > > till now).
> > > >
> > > > > With that said, it
> > > > > might provide better performance to update it once in a while. It
> > > > > might also be the case that we will get better performance with the
> > > > > new scheme if we only fill half the fill ring.
> > > >
> > > > Yes, it may improve performance. However, I don't think it's correct to
> > > > set such a limitation on the app.
> >
> > Actually, a much worse limitation to put on an application is to say
> > that you have to have a certain amount of buffers on some ring for the
> > zero-copy feature to work. For example that we need at least 64
> > buffers on the fill ring for all the NIC cards to work in zero-copy
> > mode. That would be a bad thing to have to put in the documentation.
> > An OS is supposed to abstract away HW differences, and with this
> > current limitation in your driver, it shines through for sure. What we
> > would like to put in the documentation is a statement along the lines
> > of: "for best performance, make sure you have plenty of buffers on the
> > fill ring so that the NIC can work as efficiently as possible". Not a
> > statement that it does not work on Mellanox unless you put enough
> > buffers on the fill ring. So my advice (and wish) is that you fix this
> > in your driver. With that said, I will still look into what is the
> > best way to get at least the sample to work for you. But there is no
> > way to make sure every single app works for you in zero-copy mode,
> > unless you support arbitrary amount of buffers on the fill ring. I
> > guess that sooner or later, a customer of yours will get into this
> > situation one way or the other, so why not fix it now.
> >
> > /Magnus
> >
> > > I will examine what provides the best performance. On one hand it is
> > > the number of updates to shared ring state (minimized by current
> > > scheme) and the ability for the user app to but buffers on the fill
> > > ring. Stating that putting e.g. half the packets on the fill ring
> > > provides better performance (for some application) is not a
> > > limitation. It is performance tuning advise :-).
>
> I have now made a set of measurements. First I just made a variation
> study using the xdpsock app, varying the amount of packets the kernel
> can withdraw from a consumer ring (fill and Tx) before updating global
> state. For the 1 core case (app and driver on the same core) the more
> frequent you do this update, the better. The reason for this is that
> it costs very little to update the state since the application is not
> running. And it is beneficial for the app to have a freshly updated
> state when it starts to execute as it can operate on more packets. For
> the 2 core case (app on one core, driver on another) it is the
> complete opposite: the fewer updates to global state, the better. The
> reason for this is that it costs a lot to update global state as it
> triggers cache coherency actions between the two cores.
>
> What I did then was to compare the current scheme, update only when
> grabbing new packets, to a new scheme were we also update the global
> consumer pointer when we are exiting Rx or Tx processing in the NAPI
> context. On two cores the current scheme gets 0.5 to 1 Mpps more in
> throughput than also updating the pointer at the end of NAPI. But for
> 1 core case, the new scheme is the best and generates between 0.2 and
> 0.3 Mpps more in throughput than the current one. But all in all, the
> current scheme is more beneficial than the proposed one if we say that
> both the 1 core and the 2 core case is equally important.
>
> Something to note is that the xdpsock application only puts batch size
> (64) of packets in the fill ring in every iteration, and this might
> lead to some good pacing for the current scheme and the 2 core case.
> I.e., we do not get into the case of the fill ring only being full or
> empty. But I will run this on some real apps to get some more results,
> and I know that Björn has an optimized xdpsock application that puts
> many more packets into the fill queue than 64. This optimization might
> actually make the new proposal (also updating at the end of NAPI) be
> better and make the current scheme suffer. We will examine this
> further and get back.

Actually, after some more thought and discussions I think we should
optimize for the 1 core case, since that is what gives the whole
system the best performance, provided that you can scale your
application with instantiation that is. For a 4 core system, 4 x the 1
core performance > 2 x 2 core performance by a lot. I think that the 1
core case is the one that is going to be used out there. At least that
is what I hear and see.

So, when the merge window opens I am going to submit a patch that
updates the consumer pointer when we exit NAPI too. Will increase the
performance of the 1 core case.

/Magnus

> /Magnus
>
> > > > > I will look into this
> > > > > and see what I get.
> > > > >
> > > > > > 4. As mlx5e allocates frames in batches, the consequences are even more
> > > > > > severe: it's a deadlock where the driver waits for the application, and
> > > > > > vice versa. The driver never reaches the point where cons_tail gets
> > > > > > equal to cons_head. E.g., if cons_tail + 3 == cons_head, and the batch
> > > > > > size requested by the driver is 8, the driver won't peek anything from
> > > > > > the fill ring waiting for difference between cons_tail and cons_head to
> > > > > > increase to be at least 8. On the other hand, the application can't put
> > > > > > anything to the ring, because it still thinks that the consumer index is
> > > > > > 0. As cons_tail never reaches cons_head, the consumer index doesn't get
> > > > > > updated, hence the deadlock.
> > > > >
> > > > > Good thing that you detected this. Maybe I should get a Mellanox card
> > > > > :-). This is different from how we implemented Intel's drivers that
> > > > > just work on any batch size. If it gets 3 packets back, it will use
> > > > > those. How do you deal with the case when the application just puts a
> > > > > single buffer in the fill ring and wants to receive a single packet?
> > > >
> > > > mlx5e will wait until the full batch is available. As AF_XDP is intended
> > > > for high-performance apps, this scenario is less expected. We prefer to
> > > > leverage our performance features.
> > >
> > > That you cannot support all applications on top of AF_XDP with your
> > > zero-copy support seems broken to me. But I give you that you might
> > > support all the ones you care about with your current batching
> > > support. Like you say, most apps will put plenty of buffers on the
> > > fill ring, so this should not be a problem. Can you not implement some
> > > slow path for these cases? You must have one for the skb case.
> > >
> > > > > Why does the Mellanox driver need a specific batch size? This is just
> > > > > for my understanding so we can find a good solution.
> > > >
> > > > The main reason is our performance feature called striding RQ. Skipping
> > > > all irrelevant details, a batch of 64 pages is posted to the NIC with a
> > > > single request, and the NIC fills them progressively. This feature is
> > > > turned on by default on modern NICs, and it's really good for
> > > > performance.
> > > >
> > > > It might be possible to post a smaller batch though, I'm not sure about
> > > > it, it needs to be checked, but anyway it's not something that we will
> > > > likely do, because it's a complication of the data path, and if we know
> > > > more frames are coming, it's much better for the performance to wait for
> > > > them, rather than to post several incomplete batches.
> > > >
> > > > > > So, in my vision, the decision to remove RX_BATCH_SIZE and periodic
> > > > > > updates of the consumer index was wrong. It totally breaks mlx5e, that
> > > > > > does batching, and it will affect the performance of any driver, because
> > > > > > the application can't refill the ring until it gets completely empty and
> > > > > > the driver starts waiting for frames. I suggest that periodic updates of
> > > > > > the consumer index should be readded to xskq_cons_peek_addr.
> > > > >
> > > > > The reason I wanted to remove RX_BATCH_SIZE is that application
> > > > > developers complained about it giving rise to counter intuitive
> > > > > behavior in user space. I will try to dig out the complaints and the
> > > > > explanations Björn and I had to send which it seemed that users really
> > > > > should not have to care about. It should just work.
> > > >
> > > > I think the counter that doesn't update till the very last moment and
> > > > then advances by the ring size will also be something to complain about
> > > > (and I am the first one to complain :D). Such bursts are not good in any
> > > > case.
> > >
> > > Do you have any performance data that shows this scheme is bad for
> > > performance? The good thing about this scheme is that global state is
> > > updated less frequently. And the bad thing is what you mentioned. But
> > > which one has the greatest effect, is the question.
> > >
> > > > > I also do not like
> > > > > to have arbitrary constants like this. Why 16?
> > > >
> > > > I believe any batching mechanism has a constant that look arbitrary.
> > > > This constant should be the golden mean: if it's too small, there is
> > > > little effect from batching; if it's too big, it gets too bursty.
> > > >
> > > > Basically, after your patch it just changed from 16 to the ring size.
> > > > Maybe we can tie that constant to ring size? Make it ring_size /
> > > > another_arbitrary_constant? :)
> > >
> > > Agreed, I thought about this too. Something tied to ring_size might be
> > > a good compromise. Will examine this. But I want to base this on
> > > performance data not idle speculation, so need to run some experiments
> > > first.
> > >
> > > /Magnus
> > >
> > > > > Would much prefer not
> > > > > having to deal with this, unless of course it horribly breaks
> > > > > something or gives rise to worse performance. Might still be the case
> > > > > here, but if not, I would like to remove it.
> > > > >
> > > > > Thanks: Magnus
> > > > >
> > > > > > Magnus, what do you think of the suggestion above?
> > > > > >
> > > > > > Thanks,
> > > > > > Max
> > > > > >
> > > > > > >>
> > > > > > >> I tracked this issue in our internal bug tracker in case we need to
> > > > > > >> perform actual debugging of mlx5e. I'm looking forward to your feedback
> > > > > > >> on my assumption above.
> > > > > > >>
> > > > > > >>>> The hardware I am using is a Mellanox ConnectX4 2x100G card (MCX416A-
> > > > > CCAT)
> > > > > > >>>> running the mlx5 driver.
> > > > > > >>
> > > > > > >> This one should run without striding RQ, please verify it with ethtool
> > > > > > >> --show-priv-flags (the flag name is rx_striding_rq).
> > > > > > >
> > > > > > > I do not remember changing this option, so whatever the default is, is
> > > > > what it
> > > > > > > was running with. I am traveling this week and do not have access to
> > > > > these
> > > > > > > systems, but will ensure that this flag is set properly when I get back.
> > > > > > >
> > > > > >