> This adds a small internal mapping table so that a new bpf (xdp) kfunc > can perform lookups in a flowtable. > > I have no intent to push this without nft integration of the xdp program, > this RFC is just to get comments on the general direction because there > is a chicken/egg issue: > > As-is, xdp program has access to the device pointer, but no way to do a > lookup in a flowtable -- there is no way to obtain the needed struct > without whacky stunts. > > This would allow to such lookup just from device address: the bpf > kfunc would call nf_flowtable_by_dev() internally. > > Limitation: > > A device cannot be added to multiple flowtables, the mapping needs > to be unique. > > As for integration with the kernel, there are several options: > > 1. Auto-add to the dev-xdp table whenever HW offload is requested. > > 2. Add to the dev-xdp table, but only if the HW offload request fails. > (softfallback). > > 3. add a dedicated 'xdp offload' flag to UAPI. > > 3) should not be needed, because userspace already controls this: > to make it work userspace needs to attach the xdp program to the > network device in the first place. > > My thinking is to add a xdp-offload flag to the nft grammer only. > Its not needed on nf uapi side and it would tell nft to attach the xdp > flowtable forward program to the devices listed in the flowtable. > > Also, packet flow is altered (qdiscs is bypassed), which is a strong > argument against default-usage. > > The xdp prog source would be included with nftables.git and nft > would either attach/detach them or ship an extra prog that does this (TBD). Hi Florian, thx for working on this, I tested this patch with the flowtable lookup kfunc I am working on (code is available here [0]) and it works properly. Regards, Lorenzo [0] https://github.com/LorenzoBianconi/bpf-next/tree/xdp-flowtable-kfunc > > Open questions: > > Do we need to support dev-in-multiple flowtables? I would like to > avoid this, this likely means the future "xdp" flag in nftables would > be restricted to "inet" family. Alternative would be to change the key to > 'device address plus protocol family', the xdp prog could derive that from the > packet data. > > Timeout handling. Should the XDP program even bother to refresh the > flowtable timeout? I was assuming the flowtable lookup kfunc can take care of it. > > It might make more sense to intentionally have packets > flow through the normal path periodically so neigh entries are up to date. > > Also note that flow_offload_xdp struct likely needs to have a refcount > or genmask so that it integrates with the two-phase commit protocol on > netfilter side > (i.e., we should allow 're-add' because its needed to make flush+reload > work). > > Not SoB, too raw for my taste. > > CC: Lorenzo Bianconi <lorenzo@xxxxxxxxxx> > --- > net/netfilter/nf_flow_table_offload.c | 131 +++++++++++++++++++++++++- > 1 file changed, 130 insertions(+), 1 deletion(-) > > diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c > index a010b25076ca..10313d296a8a 100644 > --- a/net/netfilter/nf_flow_table_offload.c > +++ b/net/netfilter/nf_flow_table_offload.c > @@ -17,6 +17,92 @@ static struct workqueue_struct *nf_flow_offload_add_wq; > static struct workqueue_struct *nf_flow_offload_del_wq; > static struct workqueue_struct *nf_flow_offload_stats_wq; > > +struct flow_offload_xdp { > + struct hlist_node hnode; > + > + unsigned long net_device_addr; > + struct nf_flowtable *ft; > + > + struct rcu_head rcuhead; > +}; > + > +#define NF_XDP_HT_BITS 4 > +static DEFINE_HASHTABLE(nf_xdp_hashtable, NF_XDP_HT_BITS); > +static DEFINE_MUTEX(nf_xdp_hashtable_lock); > + > +/* caller must hold rcu read lock */ > +struct nf_flowtable *nf_flowtable_by_dev(const struct net_device *dev) > +{ I think this routine needs to be added to some include file (e.g. include/net/netfilter/nf_flow_table.h) > + unsigned long key = (unsigned long)dev; > + const struct flow_offload_xdp *cur; > + > + hash_for_each_possible_rcu(nf_xdp_hashtable, cur, hnode, key) { > + if (key == cur->net_device_addr) > + return cur->ft; > + } > + > + return NULL; > +} > + > +static int nf_flowtable_by_dev_insert(struct nf_flowtable *ft, > + const struct net_device *dev) > +{ > + unsigned long key = (unsigned long)dev; > + struct flow_offload_xdp *cur; > + int err = 0; > + > + mutex_lock(&nf_xdp_hashtable_lock); > + hash_for_each_possible(nf_xdp_hashtable, cur, hnode, key) { > + if (key != cur->net_device_addr) > + continue; > + err = -EEXIST; > + break; > + } > + > + if (err == 0) { > + struct flow_offload_xdp *new; > + > + new = kzalloc(sizeof(*new), GFP_KERNEL); > + if (new) { > + new->net_device_addr = key; > + new->ft = ft; > + > + hash_add_rcu(nf_xdp_hashtable, &new->hnode, key); > + } else { > + err = -ENOMEM; > + } > + } > + > + mutex_unlock(&nf_xdp_hashtable_lock); > + > + DEBUG_NET_WARN_ON_ONCE(err == 0 && nf_flowtable_by_dev(dev) != ft); > + > + return err; > +} > + > +static void nf_flowtable_by_dev_remove(const struct net_device *dev) > +{ > + unsigned long key = (unsigned long)dev; > + struct flow_offload_xdp *cur; > + bool found = false; > + > + mutex_lock(&nf_xdp_hashtable_lock); > + > + hash_for_each_possible(nf_xdp_hashtable, cur, hnode, key) { > + if (key != cur->net_device_addr) > + continue; > + > + hash_del_rcu(&cur->hnode); > + kfree_rcu(cur, rcuhead); > + found = true; > + break; > + } > + > + mutex_unlock(&nf_xdp_hashtable_lock); > + > + WARN_ON_ONCE(!found); > +} > + > struct flow_offload_work { > struct list_head list; > enum flow_cls_command cmd; > @@ -1183,6 +1269,38 @@ static int nf_flow_table_offload_cmd(struct flow_block_offload *bo, > return 0; > } > > +static int nf_flow_offload_xdp_setup(struct nf_flowtable *flowtable, > + struct net_device *dev, > + enum flow_block_command cmd) > +{ > + switch (cmd) { > + case FLOW_BLOCK_BIND: > + return nf_flowtable_by_dev_insert(flowtable, dev); > + case FLOW_BLOCK_UNBIND: > + nf_flowtable_by_dev_remove(dev); > + return 0; > + } > + > + WARN_ON_ONCE(1); > + return 0; > +} > + > +static void nf_flow_offload_xdp_cancel(struct nf_flowtable *flowtable, > + struct net_device *dev, > + enum flow_block_command cmd) > +{ > + switch (cmd) { > + case FLOW_BLOCK_BIND: > + nf_flowtable_by_dev_remove(dev); > + return; > + case FLOW_BLOCK_UNBIND: > + /* We do not re-bind in case hw offload would report error > + * on *unregister*. > + */ > + break; > + } > +} > + > int nf_flow_table_offload_setup(struct nf_flowtable *flowtable, > struct net_device *dev, > enum flow_block_command cmd) > @@ -1191,6 +1309,15 @@ int nf_flow_table_offload_setup(struct nf_flowtable *flowtable, > struct flow_block_offload bo; > int err; > > + /* XXX: > + * > + * XDP offload could be made 'never fails', as xdp > + * frames that don't match are simply passed up to > + * normal nf hooks (skb sw flowtable), or to stack. > + */ > + if (nf_flow_offload_xdp_setup(flowtable, dev, cmd)) > + return -EBUSY; > + > if (!nf_flowtable_hw_offload(flowtable)) > return 0; > > @@ -1200,8 +1327,10 @@ int nf_flow_table_offload_setup(struct nf_flowtable *flowtable, > else > err = nf_flow_table_indr_offload_cmd(&bo, flowtable, dev, cmd, > &extack); > - if (err < 0) > + if (err < 0) { > + nf_flow_offload_xdp_cancel(flowtable, dev, cmd); > return err; > + } > > return nf_flow_table_block_setup(flowtable, &bo, cmd); > } > -- > 2.41.0 >
Attachment:
signature.asc
Description: PGP signature