Florian Kauer <florian.kauer@xxxxxxxxxxxxx> writes: > Hi Toke, > thanks a lot for the prompt response! > > On 7/4/24 13:20, Toke Høiland-Jørgensen wrote: >> Florian Kauer <florian.kauer@xxxxxxxxxxxxx> writes: >> >>> Hi, >>> we are currently using bpf_redirect_map with BPF_F_BROADCAST to replicate frames for sending traffic over redundant paths. >>> >>> See for example https://www.rfc-editor.org/rfc/rfc8655.html#section-3.2.2.2 for background >>> and https://github.com/EricssonResearch/xdpfrer/blob/5f0845cb2e4c4da325f0e77df3020526ad992aff/src/xdpfrer.bpf.c#L393 for the current implementation. >>> >>> However, we want to modify the frame after the replication. In the easiest case this means to change the VLAN tag to route the traffic over different VLANs. This is currently done by taking a different egress_ifindex into account after the replication and that works well so far ( https://github.com/EricssonResearch/xdpfrer/blob/5f0845cb2e4c4da325f0e77df3020526ad992aff/src/xdpfrer.bpf.c#L399 ). >>> >>> BUT there are cases where the egress_interface for both replicated packets shall be the same and the different path of the replicated frames is only taken on a subsequent switch based on a different VLAN tag. So how could the XDP program differentiate between the different replicated frames if the egress_interface is the same? >>> >>> So my basic idea would be to add two (or more) DEVMAP entries with the same ifindex into the same map. And then either >>> >>> 1. Add several xdp/devmap progs to the same loaded bpf and reference them in the DEVMAP entry, like >>> >>> SEC("xdp/devmap") >>> int replicate_postprocessing_first(struct xdp_md *pkt) >>> { >>> int ret = change_vlan(pkt, 0, true); >>> ... >>> } >>> >>> SEC("xdp/devmap") >>> int replicate_postprocessing_second(struct xdp_md *pkt) >>> { >>> int ret = change_vlan(pkt, 1, true); >>> ... >>> } >>> >>> This, however, would be quite unflexible. >> >> Having multiple entries in the devmap entry corresponds roughly to how >> the stack handles VLANs. I.e., when configuring a VLAN, you create a new >> netdevice (which you would then put into the devmap). Unfortunately, XDP >> doesn't really know how to deal with stacked devices like VLANs, so you >> can't actually add a VLAN device into a devmap. But creating an >> interface for this would be one way of dealing with a situation like >> this, without having to hardcode things into a BPF program. > > I see. That would be very nice in general, but for our specific application > likely still to unflexible to refer to a different VLAN interface > (e.g. in addition to changing the VLAN tag we also might want to > add/remove/modify MPLS labels and other headers). Yeah, part of the reason this stacked device support never materialised is that the way the kernel handles it requires a lot of indirection (MPLS+VLAN would be two stacked devices on top of each other, say), and for XDP you probably want to short-circuit this anyway. So that means rolling your own abstraction here; but you can do that with the custom devmap approach :) >>> 2. Load the same bpf several times without attaching it to an >>> interface and set e.g. a const to a different value after loading. >> >> This would work, I think. Something like: >> >> static volatile const vlan_id = 1; >> >> SEC("xdp/devmap") >> int replicate_postprocessing_second(struct xdp_md *pkt) >> { >> int ret = change_vlan(pkt, vlan_id, true); >> ... >> } >> >> and then the loader would replace the value of vlan_id before loading; >> using skeletons this would look something like: >> >> skel = xdp_program_skeleton__open(); >> skel->rodata->vlan_id = 2; >> xdp_program_skeleton__load(); >> >> /* attach to devmap */ > > Yes, that is exactly what I was imagining, thanks! > >> >>> But can I even reference a xdp/devmap prog from a different loaded >>> bpf, especially when it is not attached? >> >> Why do you need to reference it from a different BPF program? The >> userspace program just attaches it to the right devmap entry? > > What I wanted to imply with this is that the lifetime of the different BPF > programs is unclear to me. So AFAIK (but I might be totally wrong) an > XDP program has the lifetime of the process that loaded the program > (i.e. called xdp_program_skeleton__load()) so it is destroyed/unloaded > as soon as the process ends, UNLESS it is explicitly attached to an interface > by which it inherits the lifetime of the interface it was attached to > (i.e. might outlive the loading process). > > If I do what you sketched above, where we do not attach the program explicitly > to an interface but only directly to the devmap, does it then inherit the > lifetime of the interface of the BPF program the devmap belongs to or is > it destroyed as soon as the loading process end? Ah, right. Devmap attachment works similar to netlink attachment to an interface: you pass in an fd as part of the bpf_map_update() from userspace, and the devmap takes a reference on the BPF program when inserting it into the map. So the BPF program inserted this way stays alive as long as the map does, unless it is explicitly removed by another bpf_map_update(). Normally, the map itself will be referenced by the XDP program running on ingress, so as long as that stays attached, the map will stay alive. And when the (ingress) XDP program is removed, it will release its reference on the map, which in turn will cause the map to release all references to other BPF programs inserted into it. So in other words, there is no direct reference between the two XDP programs, but things should mostly keep working the way you expect them to... >>> 3. Extend the kernel with a way to let the xdp/devmap prog know from >>> which DEVMAP entry its execution originates (like an additional entry >>> in the bpf_devmap_val that is then set in the xdp_md). >> >> This could be useful in any case, so I would personally be fine with >> adding something like this (for both devmap and cpumap) :) > > Would you prefer a simple u32 (or similar) that could then be used as > index for an array or a more complex data structure/void* to fill > with arbitrary data? Well, the API for map indexing specifies a u64 map index, so just reusing that would be the simplest :) -Toke