> On Sep 20, 2018, at 4:22 PM, Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote: > > > > On 09/20/2018 03:42 PM, Song Liu wrote: >> >> >>> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher <jeffrey.t.kirsher@xxxxxxxxx> wrote: >>> >>> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote: >>>> On 09/20/2018 12:01 PM, Song Liu wrote: >>>>> The NIC driver should only enable interrupts when napi_complete_done() >>>>> returns true. This patch adds the check for ixgbe. >>>>> >>>>> Cc: stable@xxxxxxxxxxxxxxx # 4.10+ >>>>> Cc: Jeff Kirsher <jeffrey.t.kirsher@xxxxxxxxx> >>>>> Suggested-by: Eric Dumazet <edumazet@xxxxxxxxxx> >>>>> Signed-off-by: Song Liu <songliubraving@xxxxxx> >>>>> --- >>>> >>>> >>>> Well, unfortunately we do not know why this is needed, >>>> this is why I have not yet sent this patch formally. >>>> >>>> netpoll has correct synchronization : >>>> >>>> poll_napi() places into napi->poll_owner current cpu number before >>>> calling poll_one_napi() >>>> >>>> netpoll_poll_lock() does also use napi->poll_owner >>>> >>>> When netpoll calls ixgbe poll() method, it passed a budget of 0, >>>> meaning napi_complete_done() is not called. >>>> >>>> As long as we can not explain the problem properly in the changelog, >>>> we should investigate, otherwise we will probably see coming dozens of >>>> patches >>>> trying to fix a 'potential hazard'. >>> >>> Agreed, which is why I have our validation and developers looking into it, >>> while we test the current patch from Song. >> >> I figured out what is the issue here. And I have a proposal to fix it. I >> have verified that this fixes the issue in our tests. But Alexei suggests >> that it may not be the right way to fix. >> >> Here is what happened: >> >> netpoll tries to send skb with netpoll_start_xmit(). If that fails, it >> calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in >> the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs >> within the same NIC. >> >> This is problematic, because at the end napi_schedule() calls: >> >> ____napi_schedule(this_cpu_ptr(&softnet_data), n); >> >> which attached these NAPIs to softnet_data on THIS CPU. This is done >> via napi->poll_list. >> >> Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will >> not give up the ownership until it calls napi_complete_done(). However, >> for a very busy server, we usually use 16 CPUs to poll NAPI, so this >> CPU can easily be overloaded. And as a result, each call of napi->poll() >> will hit budget (of 64), and it will not call napi_complete_done(), >> and the NAPI stays in the poll_list of this CPU. >> >> When this happens, the host usually cannot get out of this state until >> we throttle/stop client traffic. >> >> >> I am pretty confident this is what happened. Please let me know if >> anything above doesn't make sense. >> >> >> Here is my proposal to fix it: Instead of polling all NAPIs within one >> NIC, I would have netpoll to only poll the NAPI that will free space >> for netpoll_start_xmit(). I attached my two RFC patches to the end of >> this email. >> >> I chatted with Alexei about this. He think polling only one NAPI may >> not guarantee netpoll make progress with the TX queue we are aiming >> for. Also, the bigger problem may be the fact that NAPIs could get >> pinned to one CPU and cannot get released. >> >> At this point, I really don't know what is the best way to fix this. >> >> I will also work on a repro with netperf. > > Thanks ! > >> >> Please let me know your suggestions. >> > > Yeah, maybe that NICs using NAPI could not provide an ndo_poll_controller() method at all, > since it is very risky (potentially grab many NAPI, and end up in this locked situation) > > poll_napi() could attempt to free skbs one napi at a time, > without the current cpu stealing all NAPI. > > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > index 57557a6a950cc9cdff959391576a03381d328c1a..a992971d366090ba69d5c1af32eadd554d6880cf 100644 > --- a/net/core/netpoll.c > +++ b/net/core/netpoll.c > @@ -205,13 +205,8 @@ static void netpoll_poll_dev(struct net_device *dev) > } > > ops = dev->netdev_ops; > - if (!ops->ndo_poll_controller) { > - up(&ni->dev_lock); > - return; > - } > - > - /* Process pending work on NIC */ > - ops->ndo_poll_controller(dev); > + if (ops->ndo_poll_controller) > + ops->ndo_poll_controller(dev); > > poll_napi(dev); > I tried to totally skip ndo_poll_controller() here. It did avoid hitting the issue. However, netpoll will drop (fail to send) more packets. Thanks, Song