On Mon, Aug 28, 2017 at 09:14:20AM -0700, John Fastabend wrote: > On 08/28/2017 09:02 AM, Andy Gospodarek wrote: > > On Fri, Aug 25, 2017 at 08:28:55AM -0700, Michael Chan wrote: > >> On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend > >> <john.fastabend@xxxxxxxxx> wrote: > >>> On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote: > >>>> On Thu, 24 Aug 2017 20:36:28 -0700 > >>>> Michael Chan <michael.chan@xxxxxxxxxxxx> wrote: > >>>> > >>>>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer > >>>>> <brouer@xxxxxxxxxx> wrote: > >>>>>> On Tue, 22 Aug 2017 23:59:05 -0700 > >>>>>> Michael Chan <michael.chan@xxxxxxxxxxxx> wrote: > >>>>>> > >>>>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck > >>>>>>> <alexander.duyck@xxxxxxxxx> wrote: > >>>>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@xxxxxxxxxxxx> wrote: > >>>>>>>>> > >>>>>>>>> Right, but it's conceivable to add an API to "return" the buffer to > >>>>>>>>> the input device, right? > >>>>>> > >>>>>> Yes, I would really like to see an API like this. > >>>>>> > >>>>>>>> > >>>>>>>> You could, it is just added complexity. "just free the buffer" in > >>>>>>>> ixgbe usually just amounts to one atomic operation to decrement the > >>>>>>>> total page count since page recycling is already implemented in the > >>>>>>>> driver. You still would have to unmap the buffer regardless of if you > >>>>>>>> were recycling it or not so all you would save is 1.000015259 atomic > >>>>>>>> operations per packet. The fraction is because once every 64K uses we > >>>>>>>> have to bulk update the count on the page. > >>>>>>>> > >>>>>>> > >>>>>>> If the buffer is returned to the input device, the input device can > >>>>>>> keep the DMA mapping. All it needs to do is to dma_sync it back to > >>>>>>> the input device when the buffer is returned. > >>>>>> > >>>>>> Yes, exactly, return to the input device. I really think we should > >>>>>> work on a solution where we can keep the DMA mapping around. We have > >>>>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized > >>>>>> page return call, to achieve this. (I imagine other arch's have a high > >>>>>> DMA overhead than Intel) > >>>>>> > >>>>>> I'm not sure how the API should look. The ixgbe recycle mechanism and > >>>>>> splitting the page (into two packets) actually complicates things, and > >>>>>> tie us into a page-refcnt based model. We could get around this by > >>>>>> each driver implementing a page-return-callback, that allow us to > >>>>>> return the page to the input device? Then, drivers implementing the > >>>>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is > >>>>>> "1" DMA-sync and reuse it in the RX queue. > >>>>>> > >>>>> > >>>>> Yeah, based on Alex' description, it's not clear to me whether ixgbe > >>>>> redirecting to a non-intel NIC or vice versa will actually work. It > >>>>> sounds like the output device has to make some assumptions about how > >>>>> the page was allocated by the input device. > >>>> > >>>> Yes, exactly. We are tied into a page refcnt based scheme. > >>>> > >>>> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping) > >>>> is also tied to the RX queue size, plus how fast the pages are returned. > >>>> This makes it very hard to tune. As I demonstrated, default ixgbe > >>>> settings does not work well with XDP_REDIRECT. I needed to increase > >>>> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to > >>>> 10Mpps) so I also needed it increase RX-ring size. But perf is best if > >>>> RX-ring size is smaller, thus two contradicting tuning needed. > >>>> > >>> > >>> The changes to decouple the ixgbe page recycle scheme (1pg per descriptor > >>> split into two halves being the default) from the number of descriptors > >>> doesn't look too bad IMO. It seems like it could be done by having some > >>> extra pages allocated upfront and pulling those in when we need another > >>> page. > >>> > >>> This would be a nice iterative step we could take on the existing API. > >>> > >>>> > >>>>> With buffer return API, > >>>>> each driver can cleanly recycle or free its own buffers properly. > >>>> > >>>> Yes, exactly. And RX-driver can implement a special memory model for > >>>> this queue. E.g. RX-driver can know this is a dedicated XDP RX-queue > >>>> which is never used for SKBs, thus opening for new RX memory models. > >>>> > >>>> Another advantage of a return API. There is also an opportunity for > >>>> avoiding the DMA map on TX. As we need to know the from-device. Thus, > >>>> we can add a DMA API, where we can query if the two devices uses the > >>>> same DMA engine, and can reuse the same DMA address the RX-side already > >>>> knows. > >>>> > >>>> > >>>>> Let me discuss this further with Andy to see if we can come up with a > >>>>> good scheme. > >>>> > >>>> Sound good, looking forward to hear what you come-up with :-) > >>>> > >>> > >>> I guess by this thread we will see a broadcom nic with redirect support > >>> soon ;) > >> > >> Yes, Andy actually has finished the coding for XDP_REDIRECT, but the > >> buffer recycling scheme has some problems. We can make it work for > >> Broadcom to Broadcom only, but we want a better solution. > > > > (Sorry for the radio silence I was AFK last week...) > > > > I finished it a little while ago, but Michael and I both have concerns > > that in a heterogenous hardware setup one can quickly run into issues > > and haven't had time to work-up a few solutions before bringing this up > > formally. It also isn't a major problem until the second > > optimized/native XDP driver appears on the scene. > > > > I can run a test where XDP redirects from an ixgbe <-> bnxt_en based > > device I get OOM kills after only a few seconds, due to the lack of > > feedback between the different drivers that the pointer to xdp->data can > > be freed/reused/etc and the different buffer allocation schemes used. > > > > hmm so how do you get OOM here, I expect the number of in-flight xdp > bufs should be limited by the number of xdps that can be posted to the > outgoing interface. If we are hitting OOM that _should_ mean the size of > the tx queue is too large. Ixgbe should be free'ing the buffer if an error > is returned from xdp xmit routines (will check this today). And bnxt should > return an error if we hit some high water mark on xmit. I reconfigured the hardware after I was done with the bnxt_en devel, but I should be able to set it up and provide some more detail. Let me repro it and debug a bit more. > > > Initially I did not think this was an issue and that xdp_do_flush_map() > > would handle this, but I think there is a still a need to be able to > > signal back to the receving device that the buffer allocated has been > > xmitted by the transmitter and can be freed. Since there is really no > > guarantee that completion of an XDP_REDIRECT action means that it is > > safe to free area pointed to by xdp->data area that contains the packet > > to be xmitted. Since the packet done interrupt handler in a driver > > cannot signal back the the receiving driver that the buffer is now safe > > to reuse/free there is a chance for trouble. > > There should be some high water mark on how many outstanding packets > can be in-flight. At the moment I assumed this was something related to > queue lengths a more explicit high water mark could added to the xmit path > and tracked in xdp infrastructure. > > > > > I was hoping to spend some time this week cooking up a patch that just > > did not allow use of XDP_REDIRECT when the ifindex of the outgoing > > device did not match that of the device to which the XDP prog was > > attached, but that probably is not worth the trouble when we would just > > fix it for real. (It would also require some really terrible hacks to > > enforce this in the kernel when all that is being done is setting up a > > map that contains the redirect table, so it is probably not useful.) > > > > I would prefer to solve the problem vs limiting the implementation > Agreed. > > The basic prototype would be something like this: > > > > (rx packet interrupt on eth0, leads to napi_poll) > > napi_poll (eth0) > > call xdp_prog (eth0) > > xdp_do_redirect (eth0) > > ndo_xdp_xmit (eth1) > > mark buffer with information netdev/ring/etc > > place buffer on tx ring for eth1 > > > > (tx done interrupt on eth1, leads to napi_poll) > > napi_poll (eth1) > > process tx interrupt (eth1) > > look up information about netdev/ring/etc > > ndo_xdp_data_free (eth0, ring, etc) > > > > Thoughts? > > >