Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target

"Nelson, Shannon" <shannon.nelson@xxxxxxx> · Wed, 8 Nov 2023 13:30:21 -0800

On 11/7/2023 7:31 AM, Toke Høiland-Jørgensen wrote:

"Nelson, Shannon" <shannon.nelson@xxxxxxx> writes:

While testing new code to support XDP in the ionic driver we found that
we could panic the kernel by running a bind/unbind loop on the target
interface of an xdp_redirect action.  Obviously this is a stress test
that is abusing the system, but it does point to a window of opportunity
in bq_enqueue() and bq_xmit_all().  I believe that while the validity of
the target interface has been checked in __xdp_enqueue(), the interface
can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to
use the interface.  There is no locking or reference taken on the
interface to hold it in place before the target’s ndo_xdp_xmit() is called.

Below is a stack trace that our tester captured while running our test
code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a
non-upstream kernel.  But if you look at the current upstream code in
kernel/bpf/devmap.c I think you can see what we ran into.

Other than telling users to not abuse the system with a bind/unbind
loop, is there something we can do to limit the potential pain here?
Without knowing what interfaces might be targeted by the users’ XDP
programs, is there a step the originating driver can do to take
precautions?  Did we simply miss a step in the driver, or is this an
actual problem in the devmap code?

Sounds like a driver bug :)

Entirely possible, wouldn't be our first ... :-)

The XDP redirect flow guarantees that all outstanding packets are
flushed within a single NAPI cycle, as documented here:
https://docs.kernel.org/bpf/redirect.html

So basically, the driver should be doing a two-step teardown: remove
global visibility of the resource in question, wait for all concurrent
users to finish, and *then* free the data structure. This corresponds to
the usual RCU protection: resources should be kept alive until all
concurrent RCU critical sections have exited on all CPUs. So if your
driver is removing an interface's data structure without waiting for
concurrent NAPI cycles to finish, that's a bug in the driver.

This kind of thing is what the synchronize_net() function is for; for a
usage example, see veth_napi_del_range(). My guess would be that you're
missing this as part of your driver teardown flow?

Essentially, the first thing we do in the remove function is to call 
unregister_netdev(), which has synchronize_net() in the path, so I don't 
think this is missing from our scenario, but thanks for the hint, I'll 
keep this in mind.  I do see there are a couple of net drivers that are 
more aggressive about calling it directly in some other parts of the 
logic - I don't think that has a bearing on this issue, but I'll keep it 
in mind.

Another source of a bug like this could be that your driver does not in
fact call xdp_do_flush() before exiting its NAPI cycle, so that there
will be packets from the previous cycle in the bq queue, in which case
the assumption mentioned in the linked document obviously breaks down.
But that would also be a driver bug :)

We do call the xdp_do_flush() at the end of the NAPI cycle, just before 
calling napi_complete_done().

-Toke

Thanks for the notes - I'll have our tester spend some more time with 
this using other drivers/interfaces as the targets to see if we can 
gather more information on the scenario.

sln