Re: [PATCH net-next] docs: networking: document NAPI

Jakub Kicinski <kuba@xxxxxxxxxx> · Mon, 20 Mar 2023 17:02:21 -0700

On Thu, 16 Mar 2023 16:16:39 -0700 Florian Fainelli wrote:
> Did it not stand for New API?

I think it did. But we had extra 20 years of software development
experience and now agree that naming things "new" or "next" is
a bad idea? So let's pretend it stands for nothing. Or NAPI API ;)

> > +NAPI processing usually happens in the software interrupt context,
> > +but user may choose to use separate kernel threads for NAPI processing.  
> 
> (called threaded NAPI)

I added a cross link:

but user may choose to use :ref:`separate kernel threads<threaded>`

(and same for the busy poll sentence).

> > +Control API
> > +-----------
> > +
> > +netif_napi_add() and netif_napi_del() add/remove a NAPI instance
> > +from the system. The instances are attached to the netdevice passed
> > +as argument (and will be deleted automatically when netdevice is
> > +unregistered). Instances are added in a disabled state.
> > +
> > +napi_enable() and napi_disable() manage the disabled state.
> > +A disabled NAPI can't be scheduled and its poll method is guaranteed
> > +to not be invoked. napi_disable() waits for ownership of the NAPI
> > +instance to be released.  
> 
> Might add a word that calling napi_disable() twice will deadlock? This 
> seems to be a frequent trap driver authors fall into.

Good point. I'll say that the APIs are not idempotent:

The control APIs are not idempotent. Control API calls are safe against
concurrent use of datapath APIs but incorrect sequence of control API
calls may result in crashes, deadlocks, or race conditions. For example
calling napi_disable() multiple times in a row will deadlock.

> > +Datapath API
> > +------------
> > +
> > +napi_schedule() is the basic method of scheduling a NAPI poll.
> > +Drivers should call this function in their interrupt handler
> > +(see :ref:`drv_sched` for more info). Successful call to napi_schedule()
> > +will take ownership of the NAPI instance.
> > +
> > +Some time after NAPI is scheduled driver's poll method will be
> > +called to process the events/packets. The method takes a ``budget``
> > +argument - drivers can process completions for any number of Tx
> > +packets but should only process up to ``budget`` number of
> > +Rx packets. Rx processing is usually much more expensive.  
> 
> In other words, it is recommended to ignore the budget argument when 
> performing TX buffer reclamation to ensure that the reclamation is not 
> arbitrarily bounded, however it is required to honor the budget argument 
> for RX processing.

Added verbatim.

> > +.. warning::
> > +
> > +   ``budget`` may be 0 if core tries to only process Tx completions
> > +   and no Rx packets.
> > +
> > +The poll method returns amount of work performed.  
> 
> returns the amount of work.

Hm. Reads to me like we need an attributive(?) in this sentence.
"amount of work done" maybe? No?

> > +has outstanding work to do (e.g. ``budget`` was exhausted)
> > +the poll method should return exactly ``budget``. In that case
> > +the NAPI instance will be serviced/polled again (without the
> > +need to be scheduled).
> > +
> > +If event processing has been completed (all outstanding packets
> > +processed) the poll method should call napi_complete_done()
> > +before returning. napi_complete_done() releases the ownership
> > +of the instance.
> > +
> > +.. warning::
> > +
> > +   The case of finishing all events and using exactly ``budget``
> > +   must be handled carefully. There is no way to report this
> > +   (rare) condition to the stack, so the driver must either
> > +   not call napi_complete_done() and wait to be called again,
> > +   or return ``budget - 1``.
> > +
> > +   If ``budget`` is 0 napi_complete_done() should never be called.  
> 
> Can we detail when budget may be 0?

I was trying to avoid enshrining implementation details.
budget == 0 -> don't process Rx, don't ask why.
In practice AFAIK it's only done by netpoll. I don't think AF_XDP 
does it.

> > +Call sequence
> > +-------------
> > +
> > +Drivers should not make assumptions about the exact sequencing
> > +of calls. The poll method may be called without driver scheduling
> > +the instance (unless the instance is disabled). Similarly if

s/if//

> > +it's not guaranteed that the poll method will be called, even
> > +if napi_schedule() succeeded (e.g. if the instance gets disabled).  
> 
> You lost me there, it seems to me that what you mean to say is that:
> 
> - drivers should ensure that past the point where they call 
> netif_napi_add(), any software context referenced by the NAPI poll 
> function should be fully set-up
> 
> - it is not guaranteed that the NAPI poll function will not be called 
> once netif_napi_disable() returns

That is guaranteed. What's not guaranteed is 1:1 relationship between
napi_schedule() and napi->poll(). For busy polling we'll see
napi->poll() without there ever being an interrupt. And inverse may
also be true, where napi_schedule() is done but the polling never
happens.

I'm trying to make sure nobody tries to split the logic between the IRQ
handler and napi->poll(), expecting 1:1.

> > +NAPI instances most often correspond 1:1:1 to interrupts and queue pairs
> > +(queue pair is a set of a single Rx and single Tx queue).  
> 
> correspond to.

1:1:1 is meant as an attributive(?), describing the relationship
as 3-way 1:1.

> > As is the case with any busy polling it trades
> > +off CPU cycles for lower latency (in fact production uses of NAPI busy
> > +polling are not well known).  
> 
> Did not this originate via Intel at the request of financial companies 
> doing high speed trading? Have they moved entirely away from busy 
> polling nowadays?

No idea. If someone knows of prod use please speak up? 🤷️
I feel like the theoretical excitement about this feature does 
not match its impact :S

> > The configuration is per netdevice and will affect all
> > +NAPI instances of that device. Each NAPI instance will spawn a separate
> > +thread (called ``napi/${ifc-name}-${napi-id}``).
> > +
> > +It is recommended to pin each kernel thread to a single CPU, the same
> > +CPU as services the interrupt. Note that the mapping between IRQs and
> > +NAPI instances may not be trivial (and is driver dependent).
> > +The NAPI instance IDs will be assigned in the opposite order
> > +than the process IDs of the kernel threads.  
> 
> Device drivers may opt for threaded NAPI behavior by default by calling 
> dev_set_threaded(.., true)

Let's not advertise it too widely... We'd need to describe under what
conditions it's okay to opt-in by default.

Fixed all the points which I'm not quoting. Thanks!!