On Sun, Sep 22, 2019 at 10:16 PM Jason Wang <jasowang@xxxxxxxxxx> wrote: > > > On 2019/9/23 上午11:18, Matt Cover wrote: > > On Sun, Sep 22, 2019 at 7:34 PM Jason Wang <jasowang@xxxxxxxxxx> wrote: > >> > >> On 2019/9/23 上午9:15, Matt Cover wrote: > >>> On Sun, Sep 22, 2019 at 5:51 PM Jason Wang <jasowang@xxxxxxxxxx> wrote: > >>>> On 2019/9/23 上午6:30, Matt Cover wrote: > >>>>> On Sun, Sep 22, 2019 at 1:36 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > >>>>>> On Sun, Sep 22, 2019 at 10:43:19AM -0700, Matt Cover wrote: > >>>>>>> On Sun, Sep 22, 2019 at 5:37 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > >>>>>>>> On Fri, Sep 20, 2019 at 11:58:43AM -0700, Matthew Cover wrote: > >>>>>>>>> Treat a negative return from a TUNSETSTEERINGEBPF bpf prog as a signal > >>>>>>>>> to fallback to tun_automq_select_queue() for tx queue selection. > >>>>>>>>> > >>>>>>>>> Compilation of this exact patch was tested. > >>>>>>>>> > >>>>>>>>> For functional testing 3 additional printk()s were added. > >>>>>>>>> > >>>>>>>>> Functional testing results (on 2 txq tap device): > >>>>>>>>> > >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun no prog ========== > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1' > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran > >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog -1 ========== > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '-1' > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1' > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran > >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 0 ========== > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '0' > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0' > >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 1 ========== > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '1' > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '1' > >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 2 ========== > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '2' > >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0' > >>>>>>>>> > >>>>>>>>> Signed-off-by: Matthew Cover <matthew.cover@xxxxxxxxxxxxx> > >>>>>>>> Could you add a bit more motivation data here? > >>>>>>> Thank you for these questions Michael. > >>>>>>> > >>>>>>> I'll plan on adding the below information to the > >>>>>>> commit message and submitting a v2 of this patch > >>>>>>> when net-next reopens. In the meantime, it would > >>>>>>> be very helpful to know if these answers address > >>>>>>> some of your concerns. > >>>>>>> > >>>>>>>> 1. why is this a good idea > >>>>>>> This change allows TUNSETSTEERINGEBPF progs to > >>>>>>> do any of the following. > >>>>>>> 1. implement queue selection for a subset of > >>>>>>> traffic (e.g. special queue selection logic > >>>>>>> for ipv4, but return negative and use the > >>>>>>> default automq logic for ipv6) > >>>>>>> 2. determine there isn't sufficient information > >>>>>>> to do proper queue selection; return > >>>>>>> negative and use the default automq logic > >>>>>>> for the unknown > >>>>>>> 3. implement a noop prog (e.g. do > >>>>>>> bpf_trace_printk() then return negative and > >>>>>>> use the default automq logic for everything) > >>>>>>> > >>>>>>>> 2. how do we know existing userspace does not rely on existing behaviour > >>>>>>> Prior to this change a negative return from a > >>>>>>> TUNSETSTEERINGEBPF prog would have been cast > >>>>>>> into a u16 and traversed netdev_cap_txqueue(). > >>>>>>> > >>>>>>> In most cases netdev_cap_txqueue() would have > >>>>>>> found this value to exceed real_num_tx_queues > >>>>>>> and queue_index would be updated to 0. > >>>>>>> > >>>>>>> It is possible that a TUNSETSTEERINGEBPF prog > >>>>>>> return a negative value which when cast into a > >>>>>>> u16 results in a positive queue_index less than > >>>>>>> real_num_tx_queues. For example, on x86_64, a > >>>>>>> return value of -65535 results in a queue_index > >>>>>>> of 1; which is a valid queue for any multiqueue > >>>>>>> device. > >>>>>>> > >>>>>>> It seems unlikely, however as stated above is > >>>>>>> unfortunately possible, that existing > >>>>>>> TUNSETSTEERINGEBPF programs would choose to > >>>>>>> return a negative value rather than return the > >>>>>>> positive value which holds the same meaning. > >>>>>>> > >>>>>>> It seems more likely that future > >>>>>>> TUNSETSTEERINGEBPF programs would leverage a > >>>>>>> negative return and potentially be loaded into > >>>>>>> a kernel with the old behavior. > >>>>>> OK if we are returning a special > >>>>>> value, shouldn't we limit it? How about a special > >>>>>> value with this meaning? > >>>>>> If we are changing an ABI let's at least make it > >>>>>> extensible. > >>>>>> > >>>>> A special value with this meaning sounds > >>>>> good to me. I'll plan on adding a define > >>>>> set to -1 to cause the fallback to automq. > >>>> Can it really return -1? > >>>> > >>>> I see: > >>>> > >>>> static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, > >>>> struct sk_buff *skb) > >>>> ... > >>>> > >>>> > >>>>> The way I was initially viewing the old > >>>>> behavior was that returning negative was > >>>>> undefined; it happened to have the > >>>>> outcomes I walked through, but not > >>>>> necessarily by design. > >>>> Having such fallback may bring extra troubles, it requires the eBPF > >>>> program know the existence of the behavior which is not a part of kernel > >>>> ABI actually. And then some eBPF program may start to rely on that which > >>>> is pretty dangerous. Note, one important consideration is to have > >>>> macvtap support where does not have any stuffs like automq. > >>>> > >>>> Thanks > >>>> > >>> How about we call this TUN_SSE_ABORT > >>> instead of TUN_SSE_DO_AUTOMQ? > >>> > >>> TUN_SSE_ABORT could be documented as > >>> falling back to the default queue > >>> selection method in either space > >>> (presumably macvtap has some queue > >>> selection method when there is no prog). > >> > >> This looks like a more complex API, we don't want userspace to differ > >> macvtap from tap too much. > >> > >> Thanks > >> > > This is barely more complex and provides > > similar to what is done in many places. > > For xdp, an XDP_PASS enacts what the > > kernel would do if there was no bpf prog. > > For tc cls in da mode, TC_ACT_OK enacts > > what the kernel would do if there was > > no bpf prog. For xt_bpf, false enacts > > what the kernel would do if there was > > no bpf prog (as long as negation > > isn't in play in the rule, I believe). > > > I think this is simply because you can't implement e.g > XDP_PASS/TC_ACT_OK through eBPF itself which is not the case of steering > prog here. > > > > > > I know that this is somewhat of an > > oversimplification and that each of > > these also means something else in > > the respective hookpoint, but I standby > > seeing value in this change. > > > > macvtap must have some default (i.e the > > action which it takes when no prog is > > loaded), even if that is just use queue > > 0. We can provide the same TUN_SSE_ABORT > > in userspace which does the same thing; > > enacts the default when returned. Any > > differences left between tap and macvtap > > would be in what the default is, not in > > these changes. And that difference already > > exists today. > > > I think it's better to safe to just drop the packet instead of trying to > workaround it. > This patch aside, dropping the packet here seems like the wrong choice. Loading a prog at this hookpoint "configures" steering. The action of configuring steering should not result in dropped packets. Suboptimal delivery is generally preferable to no delivery. Leaving the behavior as-is (i.e. relying on netdev_cap_txqueue()) or making any return which doesn't fit in a u16 simply use queue 0 would be highly preferable to dropping the packet. > Thanks > > > > > >>>>> In order to keep the new behavior > >>>>> extensible, how should we state that a > >>>>> negative return other than -1 is > >>>>> undefined and therefore subject to > >>>>> change. Is something like this > >>>>> sufficient? > >>>>> > >>>>> Documentation/networking/tc-actions-env-rules.txt > >>>>> > >>>>> Additionally, what should the new > >>>>> behavior implement when a negative other > >>>>> than -1 is returned? I would like to have > >>>>> it do the same thing as -1 for now, but > >>>>> with the understanding that this behavior > >>>>> is undefined. Does this sound reasonable? > >>>>> > >>>>>>>> 3. why doesn't userspace need a way to figure out whether it runs on a kernel with and > >>>>>>>> without this patch > >>>>>>> There may be some value in exposing this fact > >>>>>>> to the ebpf prog loader. What is the standard > >>>>>>> practice here, a define? > >>>>>> We'll need something at runtime - people move binaries between kernels > >>>>>> without rebuilding then. An ioctl is one option. > >>>>>> A sysfs attribute is another, an ethtool flag yet another. > >>>>>> A combination of these is possible. > >>>>>> > >>>>>> And if we are doing this anyway, maybe let userspace select > >>>>>> the new behaviour? This way we can stay compatible with old > >>>>>> userspace... > >>>>>> > >>>>> Understood. I'll look into adding an > >>>>> ioctl to activate the new behavior. And > >>>>> perhaps a method of checking which is > >>>>> behavior is currently active (in case we > >>>>> ever want to change the default, say > >>>>> after some suitably long transition > >>>>> period). > >>>>> > >>>>>>>> thanks, > >>>>>>>> MST > >>>>>>>> > >>>>>>>>> --- > >>>>>>>>> drivers/net/tun.c | 20 +++++++++++--------- > >>>>>>>>> 1 file changed, 11 insertions(+), 9 deletions(-) > >>>>>>>>> > >>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c > >>>>>>>>> index aab0be4..173d159 100644 > >>>>>>>>> --- a/drivers/net/tun.c > >>>>>>>>> +++ b/drivers/net/tun.c > >>>>>>>>> @@ -583,35 +583,37 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) > >>>>>>>>> return txq; > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) > >>>>>>>>> +static int tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) > >>>>>>>>> { > >>>>>>>>> struct tun_prog *prog; > >>>>>>>>> u32 numqueues; > >>>>>>>>> - u16 ret = 0; > >>>>>>>>> + int ret = -1; > >>>>>>>>> > >>>>>>>>> numqueues = READ_ONCE(tun->numqueues); > >>>>>>>>> if (!numqueues) > >>>>>>>>> return 0; > >>>>>>>>> > >>>>>>>>> + rcu_read_lock(); > >>>>>>>>> prog = rcu_dereference(tun->steering_prog); > >>>>>>>>> if (prog) > >>>>>>>>> ret = bpf_prog_run_clear_cb(prog->prog, skb); > >>>>>>>>> + rcu_read_unlock(); > >>>>>>>>> > >>>>>>>>> - return ret % numqueues; > >>>>>>>>> + if (ret >= 0) > >>>>>>>>> + ret %= numqueues; > >>>>>>>>> + > >>>>>>>>> + return ret; > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, > >>>>>>>>> struct net_device *sb_dev) > >>>>>>>>> { > >>>>>>>>> struct tun_struct *tun = netdev_priv(dev); > >>>>>>>>> - u16 ret; > >>>>>>>>> + int ret; > >>>>>>>>> > >>>>>>>>> - rcu_read_lock(); > >>>>>>>>> - if (rcu_dereference(tun->steering_prog)) > >>>>>>>>> - ret = tun_ebpf_select_queue(tun, skb); > >>>>>>>>> - else > >>>>>>>>> + ret = tun_ebpf_select_queue(tun, skb); > >>>>>>>>> + if (ret < 0) > >>>>>>>>> ret = tun_automq_select_queue(tun, skb); > >>>>>>>>> - rcu_read_unlock(); > >>>>>>>>> > >>>>>>>>> return ret; > >>>>>>>>> } > >>>>>>>>> -- > >>>>>>>>> 1.8.3.1