Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > On Mon, Feb 27, 2023 at 3:54 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: >> >> > On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: >> >> >> >> > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> >> >> >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: >> >> >> >> >> >> > I'd like to discuss a potential follow up for the previous "XDP RX >> >> >> > metadata" series [0]. >> >> >> > >> >> >> > Now that we can access (a subset of) packet metadata at RX, I'd like to >> >> >> > explore the options where we can export some of that metadata on TX. And >> >> >> > also whether it might be possible to access some of the TX completion >> >> >> > metadata (things like TX timestamp). >> >> >> > >> >> >> > I'm currently trying to understand whether the same approach I've used >> >> >> > on RX could work at TX. By May I plan to have a bunch of options laid >> >> >> > out (currently considering XSK tx/compl programs and XDP tx/compl >> >> >> > programs) so we have something to discuss. >> >> >> >> >> >> I've been looking at ways of getting a TX-completion hook for the XDP >> >> >> queueing stuff as well. For that, I think it could work to just hook >> >> >> into xdp_return_frame(), but if you want to access hardware metadata >> >> >> it'll obviously have to be in the driver. A hook in the driver could >> >> >> certainly be used for the queueing return as well, though, which may >> >> >> help making it worth the trouble :) >> >> > >> >> > Yeah, I'd like to get to completion descriptors ideally; so nothing >> >> > better than a driver hook comes to mind so far :-( >> >> > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far). >> >> >> >> Is there any other use case for this than getting the TX timestamp? Not >> >> really sure what else those descriptors contain... >> > >> > I don't think so; at least looking at mlx5 and bnxt (the latter >> > doesn't have a timestamp in the completion ring). >> > So yeah, not sure, maybe that should be on the side and be AF_XDP specific. >> > And not even involve bpf, just put the tx tstamp somewhere in umem: >> > setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION, >> > &data_relative_offset, ..); >> > OTOH, if it is only a timestamp now, it doesn't mean that's all we'd >> > have for eternity? (plus, this needs a driver "hook" for af_xdp >> > anyway, so why not make it generic?) >> >> So since this is read-only in any case, we could just make it a >> tracepoint instead of a whole new hook? That's what I was planning to do >> for xdp_return_frame(); we just need a way to refer back to the original >> frame, but we'd need to solve that anyway. Just letting XDP/AF_XDP >> specify their own packet ID in some form, and make that part of the >> tracepoint data, would probably be sufficient? > > That would probably mean a driver specific tracepoint (since we might > need to get to the completion queue descriptor)? > Idk, probably still makes sense to have something that works across > different drivers? > Or are you suggesting to just do fentry/mlx5e_free_xdpsq_desc and go from there? > Not sure I can get a umem frame, as you've mentioned; and also it > doesn't look like cqe is there... No, we'd define the tracepoint in one place (like along with the others in include/xdp/trace/event.h), and have it include the fields we need (ifindex, queue index, timestamp, some kind of packet ID). And then add a call in each driver that will support this, wherever it makes sense in that driver. This is what we do with the trace_xdp_exception() calls already scattered through drivers in the RX handling code. With this, a BPF listener can attach to just the one tracepoint, and get events from all drivers that support it (but it can filter on ifindex for the events it's interested in). This probably doesn't scale indefinitely: it's possible to add new fields to the tracepoint if we discover other uses than the timestamp, but this would probably get unwieldy at some point. However, it's a lightweight way to add a "hook" if we don't have any other use cases than getting the timestamp for now. > I guess the fact that it would arrive out-of-band (not in a umem > frame) is a minor inconvenience, the userspace should be able to join > the data together hopefully. Yeah, that's why I suggested the user-defined packet ID: if the event just includes that, the BPF or userspace program can do its own matching depending on its needs... >> >> >> > I'd like to some more input on whether applying the same idea on TX >> >> >> > makes sense or not and whether there are any sensible alternatives. >> >> >> > (IIRC, there was an attempt to do XDP on egress that went nowhere). >> >> >> >> >> >> I believe that stranded because it was deemed not feasible to cover the >> >> >> SKB TX path as well, which means it can't be symmetrical to the RX hook. >> >> >> So we ended up with the in-devmap hook instead. I'm not sure if that's >> >> >> made easier by multi-buf XDP, so that may be worth revisiting. >> >> >> >> >> >> For the TX metadata you don't really have to care about the skb path, I >> >> >> suppose, so that may not matter too much either. However, at least for >> >> >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so >> >> >> I'm not sure if it's worth having a separate hook in the driver (with >> >> >> all the added complexity and overhead that entails) just to set >> >> >> metadata? That could just as well be done on push from higher up the >> >> >> stack; per-driver kfuncs could still be useful for this, though. >> >> >> >> >> >> And of course something would be needed so that that BPF programs can >> >> >> process AF_XDP frames in the kernel before they hit the driver, but >> >> >> again I'm not sure that needs to be a hook in the driver. >> >> > >> >> > Care to elaborate more on "push from higher up the stack"? >> >> >> >> I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted >> >> by the stack calling ndo_xdp_xmit() in the driver with an array of >> >> frames that are immediately put on the wire (see bq_xmit_all() in >> >> devmap.c). So any metadata writing could be done at that point, since >> >> the target driver is already known; there's even already a program hook >> >> in there (used for in-devmap programs). >> >> >> >> > I've been thinking about mostly two cases: >> >> > - XDP_TX - I think this one technically doesn't need an extra hook; >> >> > all metadata manipulations can be done at xdp_rx? (however, not sure >> >> > how real that is, since the descriptors are probably not exposed over >> >> > there?) >> >> >> >> Well, to me XDP_REDIRECT is the most interesting one (see above). I >> >> think we could even drop the XDP_TX case and only do this for >> >> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation. >> >> I.e., it's possible to XDP_REDIRECT back to the same device, the frames >> >> will just take a slight detour up through the stack; but that could also >> >> be a good thing if it means we'll have to do less surgery to the drivers >> >> to implement this for two paths. >> >> >> >> It does have the same challenge as you outlined above, though: At that >> >> point the TX descriptor probably doesn't exist, so the driver NDO will >> >> have to do something else with the data; but maybe we can solve that >> >> without moving the hook into the driver itself somehow? >> > >> > Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same >> > "transmit something out of xdp_rx hook" umbrella. We can maybe come up >> > with a skb-like-private metadata layout (as we've discussed previously >> > for skb) here as well? But not sure it would solve all the problems? >> > I'm thinking of an af_xdp case where it wants to program something >> > similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+ >> > support) or a checksum offload. Exposing access to the driver tx hooks >> > seems like the easiest way to get there? >> >> Well, I was thinking something like exposing driver kfuncs like what you >> implemented on RX, but having the program that calls them be the one in >> the existing devmap hook (each map entry is tied to a particular netdev, >> so we could use the same type of dev-bound logic as we do on RX). The >> driver wouldn't have a TX descriptor at this point, but it could have a >> driver-private area somewhere in the xdp_frame (if we make space for it) >> which the kfuncs could just write to in whichever format it wants, so >> that copying it to the descriptor later is just a memcpy(). > > Yeah, that sounds similar to what we've discussed for the xdp->skb > path, right? Maybe it should be even some kind of extension to that? > On rx, we stash a bunch of metadata in that private area. If the frame > goes to the kernel stack -> put it into skb; if the frame goes back to > the wire -> the driver can parse it? Yeah, not a bad idea to combine those. >> There would still need to be a new hook for AF_XDP, but I think it could >> have the same semantics as a devmap prog (i.e., an XDP program type that >> runs right before TX, but after the destination device is selected); we >> could attach it to the socket, maybe? Doesn't have to be in the driver >> itself (just before the driver ndo is called for zc - I think?). > > Sounds sensible; will try to explore more, thx! Awesome, looking forward to seeing what you come up with :) -Toke