On Tue, Nov 24, 2020 at 10:01 AM Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote: > > On Mon, Nov 23, 2020 at 4:21 PM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote: > > > > On Mon, 23 Nov 2020 15:00:48 +0100, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote: > > > On Wed, Nov 18, 2020 at 9:25 AM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote: > > > > > > > > I tried to combine cq available and tx writeable, but I found it very difficult. > > > > Sometimes we pay attention to the status of "available" for both, but sometimes, > > > > we may only pay attention to one, such as tx writeable, because we can use the > > > > item of fq to write to tx. And this kind of demand may be constantly changing, > > > > and it may be necessary to set it every time before entering xsk_poll, so > > > > setsockopt is not very convenient. I feel even more that using a new event may > > > > be a better solution, such as EPOLLPRI, I think it can be used here, after all, > > > > xsk should not have OOB data ^_^. > > > > > > > > However, two other problems were discovered during the test: > > > > > > > > * The mask returned by datagram_poll always contains EPOLLOUT > > > > * It is not particularly reasonable to return EPOLLOUT based on tx not full > > > > > > > > After fixing these two problems, I found that when the process is awakened by > > > > EPOLLOUT, the process can always get the item from cq. > > > > > > > > Because the number of packets that the network card can send at a time is > > > > actually limited, suppose this value is "nic_num". Once the number of > > > > consumed items in the tx queue is greater than nic_num, this means that there > > > > must also be new recycled items in the cq queue from nic. > > > > > > > > In this way, as long as the tx configured by the user is larger, we won't have > > > > the situation that tx is already in the writeable state but cannot get the item > > > > from cq. > > > > > > I think the overall approach of tying this into poll() instead of > > > setsockopt() is the right way to go. But we need a more robust > > > solution. Your patch #3 also breaks backwards compatibility and that > > > is not allowed. Could you please post some simple code example of what > > > it is you would like to do in user space? So you would like to wake up > > > when there are entries in the cq that can be retrieved and the reason > > > you would like to do this is that you then know you can put some more > > > entries into the Tx ring and they will get sent as there now are free > > > slots in the cq. Correct me if wrong. Would an event that wakes you up > > > when there is both space in the Tx ring and space in the cq work? Is > > > there a case in which we would like to be woken up when only the Tx > > > ring is non-full? Maybe there are as it might be beneficial to fill > > > the Tx and while doing that some entries in the cq has been completed > > > and away the packets go. But it would be great if you could post some > > > simple example code, does not need to compile or anything. Can be > > > pseudo code. > > > > > > It would also be good to know if your goal is max throughput, max > > > burst size, or something else. > > > > > > Thanks: Magnus > > > > > > > My goal is max pps, If possible, increase the size of buf appropriately to > > improve throughput. like pktgen. > > > > The code like this: (tx and umem cq also is 1024, and that works with zero > > copy.) > > > > ``` > > void send_handler(xsk) > > { > > char buf[22]; > > > > while (true) { > > while (true){ > > if (send_buf_to_tx_ring(xsk, buf, sizeof(buf))) > > break; // break this when no cq or tx is full > > } > > > > if (sendto(xsk->fd)) > > break; > > } > > } > > } > > > > > > static int loop(int efd, xsk) > > { > > struct epoll_event e[1024]; > > struct epoll_event ee; > > int n, i; > > > > ee.events = EPOLLOUT; > > ee.data.ptr = NULL; > > > > epoll_ctl(efd, EPOLL_CTL_ADD, xsk->fd, &e); > > > > while (1) { > > n = epoll_wait(efd, e, sizeof(e)/sizeof(e[0]), -1); > > > > if (n == 0) > > continue; > > > > if (n < 0) { > > continue; > > } > > > > for (i = 0; i < n; ++i) { > > send_handler(xsk); > > } > > } > > } > > ``` > > > > 1. Now, since datagram_poll(that determine whether it is write able based on > > sock_writeable function) will return EPOLLOUT every time, epoll_wait will > > always return directly(this results in cpu 100%). > > We should keep patch #1. Just need to make sure we do not break > anything as I am not familiar with the path after xsk_poll returns. > > > 2. After removing datagram_poll, since tx full is a very short moment, so every > > time tx is not full is always true, epoll_wait will still return directly > > 3. After epoll_wait returns, app will try to get cq and writes it to tx again, > > but this time basically it will fail when getting cq. My analysis is that > > cq item has not returned from the network card at this time. > > > > > > Under normal circumstances, the judgment preparation for this event that can be > > written is not whether the queue or buffer is full. The judgment criterion of > > tcp is whether the free space is more than half. > > This is the origin of my #2 patch, and I found that after adding this patch, my > > above problems no longer appear. > > 1. epoll_wait no longer exits directly > > 2. Every time you receive EPOLLOUT, you can always get cq > > Got it. Make sense. And good that there is some precedence that you > are not supposed to wake up when there is one free slot. Instead you > should wake up when a lot of them are free so you can insert a batch. > So let us also keep patch #2, though I might have some comments on it, > but I will reply to that patch in that case. > > But patch #3 needs to go. How about you instead make the Tx ring > double the size of the completion ring? Let us assume patch #1 and #2 > are in place. You will get woken up when at least half the entries in > the Tx ring are available. At this point fill the Tx ring completely > and after that start cleaning the completion ring. Hopefully by this > time, there will be a number of entries in there that can be cleaned > up. Then you call sendto(). It might even be a good idea to do cq, Tx, > cq in that order. > > I consider #1 and #2 bug fixes so please base them on the bpf tree and > note this in your mail header like this: "[PATCH bpf 0/3] xsk: fix for > xsk_poll writeable". > > > > > In addition: > > What is the goal of TX_BATCH_SIZE and why this "restriction" should be added, > > which causes a lot of trouble in programming without using zero copy > > You are right, this is likely too low. I never thought of this as > something that would be used as a "fast-path". It was only a generic > fall back. But it need not be. Please produce a patch #3 that sets > this to a higher value. We do need the limit though. How about 512? Please produce one patch set with #1 and #2 based on the bpf tree and keep the TX_BATCH_SIZE patch separate. It is only a performance optimization and should be based on bpf-next and sent as a sole patch on its own. Thanks! > If you are interested in improving the performance of the Tx SKB path, > then there might be other avenues to try if you are interested. Here > are some examples: > > * Batch dev_direct_xmit. Maybe skb lists can be used. > * Do not unlock and lock for every single packet in dev_direct_xmit(). > Can be combined with the above. > * Use fragments instead of copying packets into the skb itself > * Can the bool more in netdev_start_xmit be used to increase performance > > > > > Thanks. > > > > > > > > > Xuan Zhuo (3): > > > > xsk: replace datagram_poll by sock_poll_wait > > > > xsk: change the tx writeable condition > > > > xsk: set tx/rx the min entries > > > > > > > > include/uapi/linux/if_xdp.h | 2 ++ > > > > net/xdp/xsk.c | 26 ++++++++++++++++++++++---- > > > > net/xdp/xsk_queue.h | 6 ++++++ > > > > 3 files changed, 30 insertions(+), 4 deletions(-) > > > > > > > > -- > > > > 1.8.3.1 > > > >