Re: [PATCH 0/3] xsk: fix for xsk_poll writeable

Magnus Karlsson <magnus.karlsson@xxxxxxxxx> · Tue, 24 Nov 2020 11:38:58 +0100

On Tue, Nov 24, 2020 at 10:01 AM Magnus Karlsson
<magnus.karlsson@xxxxxxxxx> wrote:
>
> On Mon, Nov 23, 2020 at 4:21 PM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote:
> >
> > On Mon, 23 Nov 2020 15:00:48 +0100, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
> > > On Wed, Nov 18, 2020 at 9:25 AM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > I tried to combine cq available and tx writeable, but I found it very difficult.
> > > > Sometimes we pay attention to the status of "available" for both, but sometimes,
> > > > we may only pay attention to one, such as tx writeable, because we can use the
> > > > item of fq to write to tx. And this kind of demand may be constantly changing,
> > > > and it may be necessary to set it every time before entering xsk_poll, so
> > > > setsockopt is not very convenient. I feel even more that using a new event may
> > > > be a better solution, such as EPOLLPRI, I think it can be used here, after all,
> > > > xsk should not have OOB data ^_^.
> > > >
> > > > However, two other problems were discovered during the test:
> > > >
> > > > * The mask returned by datagram_poll always contains EPOLLOUT
> > > > * It is not particularly reasonable to return EPOLLOUT based on tx not full
> > > >
> > > > After fixing these two problems, I found that when the process is awakened by
> > > > EPOLLOUT, the process can always get the item from cq.
> > > >
> > > > Because the number of packets that the network card can send at a time is
> > > > actually limited, suppose this value is "nic_num". Once the number of
> > > > consumed items in the tx queue is greater than nic_num, this means that there
> > > > must also be new recycled items in the cq queue from nic.
> > > >
> > > > In this way, as long as the tx configured by the user is larger, we won't have
> > > > the situation that tx is already in the writeable state but cannot get the item
> > > > from cq.
> > >
> > > I think the overall approach of tying this into poll() instead of
> > > setsockopt() is the right way to go. But we need a more robust
> > > solution. Your patch #3 also breaks backwards compatibility and that
> > > is not allowed. Could you please post some simple code example of what
> > > it is you would like to do in user space? So you would like to wake up
> > > when there are entries in the cq that can be retrieved and the reason
> > > you would like to do this is that you then know you can put some more
> > > entries into the Tx ring and they will get sent as there now are free
> > > slots in the cq. Correct me if wrong. Would an event that wakes you up
> > > when there is both space in the Tx ring and space in the cq work? Is
> > > there a case in which we would like to be woken up when only the Tx
> > > ring is non-full? Maybe there are as it might be beneficial to fill
> > > the Tx and while doing that some entries in the cq has been completed
> > > and away the packets go. But it would be great if you could post some
> > > simple example code, does not need to compile or anything. Can be
> > > pseudo code.
> > >
> > > It would also be good to know if your goal is max throughput, max
> > > burst size, or something else.
> > >
> > > Thanks: Magnus
> > >
> >
> > My goal is max pps, If possible, increase the size of buf appropriately to
> > improve throughput. like pktgen.
> >
> > The code like this: (tx and umem cq also is 1024, and that works with zero
> > copy.)
> >
> > ```
> > void send_handler(xsk)
> > {
> >     char buf[22];
> >
> >         while (true) {
> >             while (true){
> >                 if (send_buf_to_tx_ring(xsk, buf, sizeof(buf)))
> >                     break; // break this when no cq or tx is full
> >             }
> >
> >             if (sendto(xsk->fd))
> >                 break;
> >                 }
> >         }
> > }
> >
> >
> > static int loop(int efd, xsk)
> > {
> >         struct epoll_event e[1024];
> >         struct epoll_event ee;
> >         int n, i;
> >
> >         ee.events = EPOLLOUT;
> >         ee.data.ptr = NULL;
> >
> >         epoll_ctl(efd, EPOLL_CTL_ADD, xsk->fd, &e);
> >
> >         while (1) {
> >                 n = epoll_wait(efd, e, sizeof(e)/sizeof(e[0]), -1);
> >
> >                 if (n == 0)
> >                         continue;
> >
> >                 if (n < 0) {
> >                         continue;
> >                 }
> >
> >                 for (i = 0; i < n; ++i) {
> >             send_handler(xsk);
> >                 }
> >         }
> > }
> > ```
> >
> > 1. Now, since datagram_poll(that determine whether it is write able based on
> >    sock_writeable function) will return EPOLLOUT every time, epoll_wait will
> >    always return directly(this results in cpu 100%).
>
> We should keep patch #1. Just need to make sure we do not break
> anything as I am not familiar with the path after xsk_poll returns.
>
> > 2. After removing datagram_poll, since tx full is a very short moment, so every
> >    time tx is not full is always true, epoll_wait will still return directly
> > 3. After epoll_wait returns, app will try to get cq and writes it to tx again,
> >    but this time basically it will fail when getting cq. My analysis is that
> >    cq item has not returned from the network card at this time.
> >
> >
> > Under normal circumstances, the judgment preparation for this event that can be
> > written is not whether the queue or buffer is full. The judgment criterion of
> > tcp is whether the free space is more than half.
> > This is the origin of my #2 patch, and I found that after adding this patch, my
> > above problems no longer appear.
> > 1. epoll_wait no longer exits directly
> > 2. Every time you receive EPOLLOUT, you can always get cq
>
> Got it. Make sense. And good that there is some precedence that you
> are not supposed to wake up when there is one free slot. Instead you
> should wake up when a lot of them are free so you can insert a batch.
> So let us also keep patch #2, though I might have some comments on it,
> but I will reply to that patch in that case.
>
> But patch #3 needs to go. How about you instead make the Tx ring
> double the size of the completion ring? Let us assume patch #1 and #2
> are in place. You will get woken up when at least half the entries in
> the Tx ring are available. At this point fill the Tx ring completely
> and after that start cleaning the completion ring. Hopefully by this
> time, there will be a number of entries in there that can be cleaned
> up. Then you call sendto(). It might even be a good idea to do cq, Tx,
> cq in that order.
>
> I consider #1 and #2 bug fixes so please base them on the bpf tree and
> note this in your mail header like this: "[PATCH bpf 0/3] xsk: fix for
> xsk_poll writeable".
>
> >
> > In addition:
> >     What is the goal of TX_BATCH_SIZE and why this "restriction" should be added,
> >     which causes a lot of trouble in programming without using zero copy
>
> You are right, this is likely too low. I never thought of this as
> something that would be used as a "fast-path". It was only a generic
> fall back. But it need not be. Please produce a patch #3 that sets
> this to a higher value. We do need the limit though. How about 512?

Please produce one patch set with #1 and #2 based on the bpf tree and
keep the TX_BATCH_SIZE patch separate. It is only a performance
optimization and should be based on bpf-next and sent as a sole patch
on its own.

Thanks!

> If you are interested in improving the performance of the Tx SKB path,
> then there might be other avenues to try if you are interested. Here
> are some examples:
>
> * Batch dev_direct_xmit. Maybe skb lists can be used.
> * Do not unlock and lock for every single packet in dev_direct_xmit().
> Can be combined with the above.
> * Use fragments instead of copying packets into the skb itself
> * Can the bool more in netdev_start_xmit be used to increase performance
>
> >
> > Thanks.
> >
> > >
> > > > Xuan Zhuo (3):
> > > >   xsk: replace datagram_poll by sock_poll_wait
> > > >   xsk: change the tx writeable condition
> > > >   xsk: set tx/rx the min entries
> > > >
> > > >  include/uapi/linux/if_xdp.h |  2 ++
> > > >  net/xdp/xsk.c               | 26 ++++++++++++++++++++++----
> > > >  net/xdp/xsk_queue.h         |  6 ++++++
> > > >  3 files changed, 30 insertions(+), 4 deletions(-)
> > > >
> > > > --
> > > > 1.8.3.1
> > > >