Florian Kauer <florian.kauer@xxxxxxxxxxxxx> writes: > On 9/11/24 11:25, Toke Høiland-Jørgensen wrote: >> Florian Kauer <florian.kauer@xxxxxxxxxxxxx> writes: >> >>> Hi Toke, >>> >>> On 9/10/24 10:34, Toke Høiland-Jørgensen wrote: >>>> Florian Kauer <florian.kauer@xxxxxxxxxxxxx> writes: >>>> >>>>> Currently, returning XDP_REDIRECT from a xdp/devmap program >>>>> is considered as invalid action and an exception is traced. >>>>> >>>>> While it might seem counterintuitive to redirect in a xdp/devmap >>>>> program (why not just redirect to the correct interface in the >>>>> first program?), we faced several use cases where supporting >>>>> this would be very useful. >>>>> >>>>> Most importantly, they occur when the first redirect is used >>>>> with the BPF_F_BROADCAST flag. Using this together with xdp/devmap >>>>> programs, enables to perform different actions on clones of >>>>> the same incoming frame. In that case, it is often useful >>>>> to redirect either to a different CPU or device AFTER the cloning. >>>>> >>>>> For example: >>>>> - Replicate the frame (for redundancy according to IEEE 802.1CB FRER) >>>>> and then use the second redirect with a cpumap to select >>>>> the path-specific egress queue. >>>>> >>>>> - Also, one of the paths might need an encapsulation that >>>>> exceeds the MTU. So a second redirect can be used back >>>>> to the ingress interface to send an ICMP FRAG_NEEDED packet. >>>>> >>>>> - For OAM purposes, you might want to send one frame with >>>>> OAM information back, while the original frame in passed forward. >>>>> >>>>> To enable these use cases, add the XDP_REDIRECT case to >>>>> dev_map_bpf_prog_run. Also, when this is called from inside >>>>> xdp_do_flush, the redirect might add further entries to the >>>>> flush lists that are currently processed. Therefore, loop inside >>>>> xdp_do_flush until no more additional redirects were added. >>>>> >>>>> Signed-off-by: Florian Kauer <florian.kauer@xxxxxxxxxxxxx> >>>> >>>> This is an interesting use case! However, your implementation makes it >>>> way to easy to end up in a situation that loops forever, so I think we >>>> should add some protection against that. Some details below: >>>> >>>>> --- >>>>> include/linux/bpf.h | 4 ++-- >>>>> include/trace/events/xdp.h | 10 ++++++---- >>>>> kernel/bpf/devmap.c | 37 +++++++++++++++++++++++++++-------- >>>>> net/core/filter.c | 48 +++++++++++++++++++++++++++------------------- >>>>> 4 files changed, 65 insertions(+), 34 deletions(-) >>>>> >>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >>>>> index 3b94ec161e8c..1b57cbabf199 100644 >>>>> --- a/include/linux/bpf.h >>>>> +++ b/include/linux/bpf.h >>>>> @@ -2498,7 +2498,7 @@ struct sk_buff; >>>>> struct bpf_dtab_netdev; >>>>> struct bpf_cpu_map_entry; >>>>> >>>>> -void __dev_flush(struct list_head *flush_list); >>>>> +void __dev_flush(struct list_head *flush_list, int *redirects); >>>>> int dev_xdp_enqueue(struct net_device *dev, struct xdp_frame *xdpf, >>>>> struct net_device *dev_rx); >>>>> int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_frame *xdpf, >>>>> @@ -2740,7 +2740,7 @@ static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd) >>>>> return ERR_PTR(-EOPNOTSUPP); >>>>> } >>>>> >>>>> -static inline void __dev_flush(struct list_head *flush_list) >>>>> +static inline void __dev_flush(struct list_head *flush_list, int *redirects) >>>>> { >>>>> } >>>>> >>>>> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h >>>>> index a7e5452b5d21..fba2c457e727 100644 >>>>> --- a/include/trace/events/xdp.h >>>>> +++ b/include/trace/events/xdp.h >>>>> @@ -269,9 +269,9 @@ TRACE_EVENT(xdp_devmap_xmit, >>>>> >>>>> TP_PROTO(const struct net_device *from_dev, >>>>> const struct net_device *to_dev, >>>>> - int sent, int drops, int err), >>>>> + int sent, int drops, int redirects, int err), >>>>> >>>>> - TP_ARGS(from_dev, to_dev, sent, drops, err), >>>>> + TP_ARGS(from_dev, to_dev, sent, drops, redirects, err), >>>>> >>>>> TP_STRUCT__entry( >>>>> __field(int, from_ifindex) >>>>> @@ -279,6 +279,7 @@ TRACE_EVENT(xdp_devmap_xmit, >>>>> __field(int, to_ifindex) >>>>> __field(int, drops) >>>>> __field(int, sent) >>>>> + __field(int, redirects) >>>>> __field(int, err) >>>>> ), >>>>> >>>>> @@ -288,16 +289,17 @@ TRACE_EVENT(xdp_devmap_xmit, >>>>> __entry->to_ifindex = to_dev->ifindex; >>>>> __entry->drops = drops; >>>>> __entry->sent = sent; >>>>> + __entry->redirects = redirects; >>>>> __entry->err = err; >>>>> ), >>>>> >>>>> TP_printk("ndo_xdp_xmit" >>>>> " from_ifindex=%d to_ifindex=%d action=%s" >>>>> - " sent=%d drops=%d" >>>>> + " sent=%d drops=%d redirects=%d" >>>>> " err=%d", >>>>> __entry->from_ifindex, __entry->to_ifindex, >>>>> __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB), >>>>> - __entry->sent, __entry->drops, >>>>> + __entry->sent, __entry->drops, __entry->redirects, >>>>> __entry->err) >>>>> ); >>>>> >>>>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c >>>>> index 7878be18e9d2..89bdec49ea40 100644 >>>>> --- a/kernel/bpf/devmap.c >>>>> +++ b/kernel/bpf/devmap.c >>>>> @@ -334,7 +334,8 @@ static int dev_map_hash_get_next_key(struct bpf_map *map, void *key, >>>>> static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog, >>>>> struct xdp_frame **frames, int n, >>>>> struct net_device *tx_dev, >>>>> - struct net_device *rx_dev) >>>>> + struct net_device *rx_dev, >>>>> + int *redirects) >>>>> { >>>>> struct xdp_txq_info txq = { .dev = tx_dev }; >>>>> struct xdp_rxq_info rxq = { .dev = rx_dev }; >>>>> @@ -359,6 +360,13 @@ static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog, >>>>> else >>>>> frames[nframes++] = xdpf; >>>>> break; >>>>> + case XDP_REDIRECT: >>>>> + err = xdp_do_redirect(rx_dev, &xdp, xdp_prog); >>>>> + if (unlikely(err)) >>>>> + xdp_return_frame_rx_napi(xdpf); >>>>> + else >>>>> + *redirects += 1; >>>>> + break; >>>> >>>> It's a bit subtle, but dev_map_bpf_prog_run() also filters the list of >>>> frames in the queue in-place (the frames[nframes++] = xdpf; line above), >>>> which only works under the assumption that the array in bq->q is not >>>> modified while this loop is being run. But now you're adding a call in >>>> the middle that may result in the packet being put back on the same >>>> queue in the middle, which means that this assumption no longer holds. >>>> >>>> So you need to clear the bq->q queue first for this to work. >>>> Specifically, at the start of bq_xmit_all(), you'll need to first copy >>>> all the packet pointer onto an on-stack array, then run the rest of the >>>> function on that array. There's already an initial loop that goes >>>> through all the frames, so you can just do it there. >>>> >>>> So the loop at the start of bq_xmit_all() goes from the current: >>>> >>>> for (i = 0; i < cnt; i++) { >>>> struct xdp_frame *xdpf = bq->q[i]; >>>> >>>> prefetch(xdpf); >>>> } >>>> >>>> >>>> to something like: >>>> >>>> struct xdp_frame *frames[DEV_MAP_BULK_SIZE]; >>>> >>>> for (i = 0; i < cnt; i++) { >>>> struct xdp_frame *xdpf = bq->q[i]; >>>> >>>> prefetch(xdpf); >>>> frames[i] = xdpf; >>>> } >>>> >>>> bq->count = 0; /* bq is now empty, use the 'frames' and 'cnt' >>>> stack variables for the rest of the function */ >>>> >>>> >>>> >>>>> default: >>>>> bpf_warn_invalid_xdp_action(NULL, xdp_prog, act); >>>>> fallthrough; >>>>> @@ -373,7 +381,7 @@ static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog, >>>>> return nframes; /* sent frames count */ >>>>> } >>>>> >>>>> -static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) >>>>> +static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags, int *redirects) >>>>> { >>>>> struct net_device *dev = bq->dev; >>>>> unsigned int cnt = bq->count; >>>>> @@ -390,8 +398,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) >>>>> prefetch(xdpf); >>>>> } >>>>> >>>>> + int new_redirects = 0; >>>>> if (bq->xdp_prog) { >>>>> - to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, bq->dev_rx); >>>>> + to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, bq->dev_rx, >>>>> + &new_redirects); >>>>> if (!to_send) >>>>> goto out; >>>>> } >>>>> @@ -413,19 +423,21 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) >>>>> >>>>> out: >>>>> bq->count = 0; >>>>> - trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent, err); >>>>> + *redirects += new_redirects; >>>>> + trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent - new_redirects, >>>>> + new_redirects, err); >>>>> } >>>>> >>>>> /* __dev_flush is called from xdp_do_flush() which _must_ be signalled from the >>>>> * driver before returning from its napi->poll() routine. See the comment above >>>>> * xdp_do_flush() in filter.c. >>>>> */ >>>>> -void __dev_flush(struct list_head *flush_list) >>>>> +void __dev_flush(struct list_head *flush_list, int *redirects) >>>>> { >>>>> struct xdp_dev_bulk_queue *bq, *tmp; >>>>> >>>>> list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { >>>>> - bq_xmit_all(bq, XDP_XMIT_FLUSH); >>>>> + bq_xmit_all(bq, XDP_XMIT_FLUSH, redirects); >>>>> bq->dev_rx = NULL; >>>>> bq->xdp_prog = NULL; >>>>> __list_del_clearprev(&bq->flush_node); >>>>> @@ -458,8 +470,17 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, >>>>> { >>>>> struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq); >>>>> >>>>> - if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) >>>>> - bq_xmit_all(bq, 0); >>>>> + if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) { >>>>> + int redirects = 0; >>>>> + >>>>> + bq_xmit_all(bq, 0, &redirects); >>>>> + >>>>> + /* according to comment above xdp_do_flush() in >>>>> + * filter.c, xdp_do_flush is always called at >>>>> + * the end of the NAPI anyway, so no need to act >>>>> + * on the redirects here >>>>> + */ >>>> >>>> While it's true that it will be called again in NAPI, the purpose of >>>> calling bq_xmit_all() here is to make room space for the packet on the >>>> bulk queue that we're about to enqueue, and if bq_xmit_all() can just >>>> put the packet back on the queue, there is no guarantee that this will >>>> succeed. So you will have to handle that case here. >>>> >>>> Since there's also a potential infinite recursion issue in the >>>> do_flush() functions below, I think it may be better to handle this >>>> looping issue inside bq_xmit_all(). >>>> >>>> I.e., structure the code so that bq_xmit_all() guarantees that when it >>>> returns it has actually done its job; that is, that bq->q is empty. >>>> >>>> Given the above "move all frames out of bq->q at the start" change, this >>>> is not all that hard. Simply add a check after the out: label (in >>>> bq_xmit_all()) to check if bq->count is actually 0, and if it isn't, >>>> start over from the beginning of that function. This also makes it >>>> straight forward to add a recursion limit; after looping a set number of >>>> times (say, XMIT_RECURSION_LIMIT), simply turn XDP_REDIRECT into drops. >>>> >>>> There will need to be some additional protection against looping forever >>>> in __dev_flush(), to handle the case where a packet is looped between >>>> two interfaces. This one is a bit trickier, but a similar recursion >>>> counter could be used, I think. >>> >>> >>> Thanks a lot for the extensive support! >>> Regarding __dev_flush(), could the following already be sufficient? >>> >>> void __dev_flush(struct list_head *flush_list) >>> { >>> struct xdp_dev_bulk_queue *bq, *tmp; >>> int i,j; >>> >>> for (i = 0; !list_empty(flush_list) && i < XMIT_RECURSION_LIMIT; i++) { >>> /* go through list in reverse so that new items >>> * added to the flush_list will only be handled >>> * in the next iteration of the outer loop >>> */ >>> list_for_each_entry_safe_reverse(bq, tmp, flush_list, flush_node) { >>> bq_xmit_all(bq, XDP_XMIT_FLUSH); >>> bq->dev_rx = NULL; >>> bq->xdp_prog = NULL; >>> __list_del_clearprev(&bq->flush_node); >>> } >>> } >>> >>> if (i == XMIT_RECURSION_LIMIT) { >>> pr_warn("XDP recursion limit hit, expect packet loss!\n"); >>> >>> list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { >>> for (j = 0; j < bq->count; j++) >>> xdp_return_frame_rx_napi(bq->q[i]); >>> >>> bq->count = 0; >>> bq->dev_rx = NULL; >>> bq->xdp_prog = NULL; >>> __list_del_clearprev(&bq->flush_node); >>> } >>> } >>> } >> >> Yeah, this would work, I think (neat trick with iterating the list in >> reverse!), but instead of the extra loop in the end, I would suggest >> passing in an extra 'allow_redirect' parameter to bq_xmit_all(). Since >> you'll already have to handle the recursion limit inside that function, >> you can just reuse the same functionality by passing in the parameter in >> the last iteration of the first loop. >> >> Also, definitely don't put an unconditional warn_on() in the fast path - >> that brings down the system really quickly if it's misconfigured :) >> >> A warn_on_once() could work, but really I would suggest just reusing the >> _trace_xdp_redirect_map_err() tracepoint with a new error code (just has >> to be one we're not currently using and that vaguely resembles what this >> is about; ELOOP, EOVERFLOW or EDEADLOCK, maybe?). > > > The 'allow_redirect' parameter is a very good idea! If I also forward that > to dev_map_bpf_prog_run and do something like the following, I can just > reuse the existing error handling. Or is that too implict/too ugly? > > switch (act) { > case XDP_PASS: > err = xdp_update_frame_from_buff(&xdp, xdpf); > if (unlikely(err < 0)) > xdp_return_frame_rx_napi(xdpf); > else > frames[nframes++] = xdpf; > break; > case XDP_REDIRECT: > if (allow_redirect) { > err = xdp_do_redirect(rx_dev, &xdp, xdp_prog); > if (unlikely(err)) > xdp_return_frame_rx_napi(xdpf); > else > *redirects += 1; > break; > } else > fallthrough; > default: > bpf_warn_invalid_xdp_action(NULL, xdp_prog, act); > fallthrough; Yes, I was imagining something like this. Not sure if we want to turn it into an "invalid action" this way (it should probably be a separate tracepoint as I mentioned in the previous email). But otherwise, yeah! > case XDP_ABORTED: > trace_xdp_exception(tx_dev, xdp_prog, act); > fallthrough; > case XDP_DROP: > xdp_return_frame_rx_napi(xdpf); > break; > } > > >> >>>> In any case, this needs extensive selftests, including ones with devmap >>>> programs that loop packets (both redirect from a->a, and from >>>> a->b->a->b) to make sure the limits work correctly. >>> >>> >>> Good point! I am going to prepare some. >>> >>> >>>> >>>>> + } >>>>> >>>>> /* Ingress dev_rx will be the same for all xdp_frame's in >>>>> * bulk_queue, because bq stored per-CPU and must be flushed >>>>> diff --git a/net/core/filter.c b/net/core/filter.c >>>>> index 8569cd2482ee..b33fc0b1444a 100644 >>>>> --- a/net/core/filter.c >>>>> +++ b/net/core/filter.c >>>>> @@ -4287,14 +4287,18 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = { >>>>> void xdp_do_flush(void) >>>>> { >>>>> struct list_head *lh_map, *lh_dev, *lh_xsk; >>>>> + int redirect; >>>>> >>>>> - bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk); >>>>> - if (lh_dev) >>>>> - __dev_flush(lh_dev); >>>>> - if (lh_map) >>>>> - __cpu_map_flush(lh_map); >>>>> - if (lh_xsk) >>>>> - __xsk_map_flush(lh_xsk); >>>>> + do { >>>>> + redirect = 0; >>>>> + bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk); >>>>> + if (lh_dev) >>>>> + __dev_flush(lh_dev, &redirect); >>>>> + if (lh_map) >>>>> + __cpu_map_flush(lh_map); >>>>> + if (lh_xsk) >>>>> + __xsk_map_flush(lh_xsk); >>>>> + } while (redirect > 0); >>>>> } >>>>> EXPORT_SYMBOL_GPL(xdp_do_flush); >>>>> >>>>> @@ -4303,20 +4307,24 @@ void xdp_do_check_flushed(struct napi_struct *napi) >>>>> { >>>>> struct list_head *lh_map, *lh_dev, *lh_xsk; >>>>> bool missed = false; >>>>> + int redirect; >>>>> >>>>> - bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk); >>>>> - if (lh_dev) { >>>>> - __dev_flush(lh_dev); >>>>> - missed = true; >>>>> - } >>>>> - if (lh_map) { >>>>> - __cpu_map_flush(lh_map); >>>>> - missed = true; >>>>> - } >>>>> - if (lh_xsk) { >>>>> - __xsk_map_flush(lh_xsk); >>>>> - missed = true; >>>>> - } >>>>> + do { >>>>> + redirect = 0; >>>>> + bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk); >>>>> + if (lh_dev) { >>>>> + __dev_flush(lh_dev, &redirect); >>>>> + missed = true; >>>>> + } >>>>> + if (lh_map) { >>>>> + __cpu_map_flush(lh_map); >>>>> + missed = true; >>>>> + } >>>>> + if (lh_xsk) { >>>>> + __xsk_map_flush(lh_xsk); >>>>> + missed = true; >>>>> + } >>>>> + } while (redirect > 0); >>>> >>>> With the change suggested above (so that bq_xmit_all() guarantees the >>>> flush is completely done), this looping is not needed anymore. However, >>>> it becomes important in which *order* the flushing is done >>>> (__dev_flush() should always happen first), so adding a comment to note >>>> this would be good. >>> >>> >>> I guess, if we remove the loop here and we still want to cover the case of >>> redirecting from devmap program via cpumap, we need to fetch the lh_map again >>> after calling __dev_flush, right? >>> So I think we should no longer use bpf_net_ctx_get_all_used_flush_lists then: >>> >>> lh_dev = bpf_net_ctx_get_dev_flush_list(); >>> if (lh_dev) >>> __dev_flush(lh_dev); >>> lh_map = bpf_net_ctx_get_cpu_map_flush_list(); >>> if (lh_map) >>> __cpu_map_flush(lh_map); >>> lh_xsk = bpf_net_ctx_get_xskmap_flush_list(); >>> if (lh_xsk) >>> __xsk_map_flush(lh_xsk); >>> >>> But bpf_net_ctx_get_all_used_flush_lists also includes additional checks >>> for IS_ENABLED(CONFIG_BPF_SYSCALL) and IS_ENABLED(CONFIG_XDP_SOCKETS), >>> so I guess they should be directly included in the xdp_do_flush and >>> xdp_do_check_flushed? >>> Then we could remove the bpf_net_ctx_get_all_used_flush_lists. >>> Or do you have an idea for a more elegant solution? >> >> I think cpumap is fine because that doesn't immediately redirect back to >> the bulk queue; instead __cpu_map_flush() will put the frames on a >> per-CPU ring buffer and wake the kworker on that CPU. Which will in turn >> do another xdp_do_flush() if the cpumap program does a redirect. So in >> other words, we only need to handle recursion of devmap redirects :) > > I likely miss something here, but the scenario I am thinking about is the following: > > 1. cpu_map and xsk_map flush list are empty, while the dev flush map has > a single frame pending, i.e. in the current implementation after > executing bpf_net_ctx_get_all_used_flush_lists: > lh_dev = something > lh_map = lh_xsk = NULL > > 2. __dev_flush gets called and executes a bpf program on the frame > that returns with "return bpf_redirect_map(&cpu_map, 0, 0);" > and that adds an item to the cpumap flush list. > > 3. Since __dev_flush is only able to handle devmap redirects itself, > the item is still on the cpumap flush list after __dev_flush > has returned. > > 4. lh_map, however, is still NULL, so __cpu_map_flush does not > get called and thus the kworker is never woken up. > > That is at least what I see on the running system that the kworker > is never woken up, but I might have done something else wrong... Ah, yes, I see what you mean. Yup, you're right, bpf_net_ctx_get_all_used_flush_lists() will no longer work, we'll have to get the others after __dev_flush() -Toke