On Fri, Mar 3, 2023 at 4:37 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > Magnus Karlsson <magnus.karlsson@xxxxxxxxx> writes: > > > On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > >> > >> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> > > >> > Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > >> > > >> > > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> > >> > >> > >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > >> > >> > >> > >> > I'd like to discuss a potential follow up for the previous "XDP RX > >> > >> > metadata" series [0]. > >> > >> > > >> > >> > Now that we can access (a subset of) packet metadata at RX, I'd like to > >> > >> > explore the options where we can export some of that metadata on TX. And > >> > >> > also whether it might be possible to access some of the TX completion > >> > >> > metadata (things like TX timestamp). > >> > >> > > >> > >> > I'm currently trying to understand whether the same approach I've used > >> > >> > on RX could work at TX. By May I plan to have a bunch of options laid > >> > >> > out (currently considering XSK tx/compl programs and XDP tx/compl > >> > >> > programs) so we have something to discuss. > >> > >> > >> > >> I've been looking at ways of getting a TX-completion hook for the XDP > >> > >> queueing stuff as well. For that, I think it could work to just hook > >> > >> into xdp_return_frame(), but if you want to access hardware metadata > >> > >> it'll obviously have to be in the driver. A hook in the driver could > >> > >> certainly be used for the queueing return as well, though, which may > >> > >> help making it worth the trouble :) > >> > > > >> > > Yeah, I'd like to get to completion descriptors ideally; so nothing > >> > > better than a driver hook comes to mind so far :-( > >> > > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far). > >> > > >> > Is there any other use case for this than getting the TX timestamp? Not > >> > really sure what else those descriptors contain... > >> > >> I don't think so; at least looking at mlx5 and bnxt (the latter > >> doesn't have a timestamp in the completion ring). > >> So yeah, not sure, maybe that should be on the side and be AF_XDP specific. > >> And not even involve bpf, just put the tx tstamp somewhere in umem: > >> setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION, > >> &data_relative_offset, ..); > >> OTOH, if it is only a timestamp now, it doesn't mean that's all we'd > >> have for eternity? (plus, this needs a driver "hook" for af_xdp > >> anyway, so why not make it generic?) > >> > >> > >> > I'd like to some more input on whether applying the same idea on TX > >> > >> > makes sense or not and whether there are any sensible alternatives. > >> > >> > (IIRC, there was an attempt to do XDP on egress that went nowhere). > >> > >> > >> > >> I believe that stranded because it was deemed not feasible to cover the > >> > >> SKB TX path as well, which means it can't be symmetrical to the RX hook. > >> > >> So we ended up with the in-devmap hook instead. I'm not sure if that's > >> > >> made easier by multi-buf XDP, so that may be worth revisiting. > >> > >> > >> > >> For the TX metadata you don't really have to care about the skb path, I > >> > >> suppose, so that may not matter too much either. However, at least for > >> > >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so > >> > >> I'm not sure if it's worth having a separate hook in the driver (with > >> > >> all the added complexity and overhead that entails) just to set > >> > >> metadata? That could just as well be done on push from higher up the > >> > >> stack; per-driver kfuncs could still be useful for this, though. > >> > >> > >> > >> And of course something would be needed so that that BPF programs can > >> > >> process AF_XDP frames in the kernel before they hit the driver, but > >> > >> again I'm not sure that needs to be a hook in the driver. > >> > > > >> > > Care to elaborate more on "push from higher up the stack"? > >> > > >> > I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted > >> > by the stack calling ndo_xdp_xmit() in the driver with an array of > >> > frames that are immediately put on the wire (see bq_xmit_all() in > >> > devmap.c). So any metadata writing could be done at that point, since > >> > the target driver is already known; there's even already a program hook > >> > in there (used for in-devmap programs). > >> > > >> > > I've been thinking about mostly two cases: > >> > > - XDP_TX - I think this one technically doesn't need an extra hook; > >> > > all metadata manipulations can be done at xdp_rx? (however, not sure > >> > > how real that is, since the descriptors are probably not exposed over > >> > > there?) > >> > > >> > Well, to me XDP_REDIRECT is the most interesting one (see above). I > >> > think we could even drop the XDP_TX case and only do this for > >> > XDP_REDIRECT, since XDP_TX is basically a special-case optimisation. > >> > I.e., it's possible to XDP_REDIRECT back to the same device, the frames > >> > will just take a slight detour up through the stack; but that could also > >> > be a good thing if it means we'll have to do less surgery to the drivers > >> > to implement this for two paths. > >> > > >> > It does have the same challenge as you outlined above, though: At that > >> > point the TX descriptor probably doesn't exist, so the driver NDO will > >> > have to do something else with the data; but maybe we can solve that > >> > without moving the hook into the driver itself somehow? > >> > >> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same > >> "transmit something out of xdp_rx hook" umbrella. We can maybe come up > >> with a skb-like-private metadata layout (as we've discussed previously > >> for skb) here as well? But not sure it would solve all the problems? > >> I'm thinking of an af_xdp case where it wants to program something > >> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+ > >> support) > > > > We have a patch set of this in the works. Need to finish the last > > couple of tests then optimize performance and it is good to go. We > > should be able to post it during the next cycle that starts next week. > > Uh, exciting, will look forward to seeing that! :) +1, looking forward! > -Toke