On Thu, 25 Jan 2018 12:28:58 +0100, Pablo Neira Ayuso wrote: > On Wed, Jan 24, 2018 at 05:31:36PM -0800, Jakub Kicinski wrote: > > On Thu, 25 Jan 2018 01:09:41 +0100, Pablo Neira Ayuso wrote: > > > This patch adds the infrastructure to offload flows to hardware, in case > > > the nic/switch comes with built-in flow tables capabilities. > > > > > > If the hardware comes with no hardware flow tables or they have > > > limitations in terms of features, the existing infrastructure falls back > > > to the software flow table implementation. > > > > > > The software flow table garbage collector skips entries that resides in > > > the hardware, so the hardware will be responsible for releasing this > > > flow table entry too via flow_offload_dead(). > > > > > > Hardware configuration, either to add or to delete entries, is done from > > > the hardware offload workqueue, to ensure this is done from user context > > > given that we may sleep when grabbing the mdio mutex. > > > > > > Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> > > > > I wonder how do you deal with device/table removal? I know regrettably > > little about internals of nftables. I assume the table cannot be > > removed/module unloaded as long as there are flow entries? And on > > device removal all flows pertaining to the removed ifindex will be > > automatically flushed? > > Yes, this code is part of the generic software infrastructure, it's > not specific to the hardware offload, it's already upstream, see > net/netfilter/nft_flow_offload.c, see flow_offload_netdev_notifier. Hm. At a glance flow_offload_iterate_cleanup() will just mark the flows as dead, not request their removal from the HW. Doesn't that mean that reloading the HW driver with flows installed will likely lead to HW/FW resources being leaked (unless every driver duplicates manual flush on remove). > > Still there could be outstanding work items targeting the device, so > > this WARN_ON: > > > > + indev = dev_get_by_index(net, ifindex); > > + if (WARN_ON(!indev)) > > + return 0; > > > > looks possible to trigger. > > It should not, that's why there's a WARN_ON there ;-). > > See nf_flow_table_hw_module_exit(), there's a call to > cancel_work_sync() to stop the hw offload workqueue, then flushes it. > After this, there's a flow table cleanup. So noone should be calling > that function by then. Ah, I must be misunderstanding. I meant when device is removed, not the flow_table_hw module. Does the nf_flow_table_hw_module_exit() run when device is removed? I was expecting that, for example something like nft_flow_offload_iterate_cleanup() would queue up all the flow remove calls and then call flush_work() (not cancel_work). > > On the general architecture - I think it's worth documenting somewhere > > clearly that unlike TC offloads and most NDOs add/del of NFT flows are > > not protected by rtnl_lock. > > Someone could probably look at getting rid of the rtnl_lock() all over > the place for hardware offloads, holding on the entire rtnetlink > subsystem just because some piece of hardware is taking time to > configure things is not good. Not explicitly related to this, but have > a look at Florian Westphal's talk on rtnl_lock during NetDev. Oh, 100% agreed. I was just pointing out that it could be useful to mention the locking in kdoc or at least the commit message. > > > v4: More work in progress > > > - Decouple nf_flow_table_hw from nft_flow_offload via rcu hooks > > > - Consolidate ->ndo invocations, now they happen from the hw worker. > > > - Fix bug in list handling, use list_replace_init() > > > - cleanup entries on nf_flow_table_hw module removal > > > - add NFT_FLOWTABLE_F_HW flag to flowtables to explicit signal that user wants > > > to offload entries to hardware. > > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > > index ed0799a12bf2..be0c12acc3f0 100644 > > > --- a/include/linux/netdevice.h > > > +++ b/include/linux/netdevice.h > > > @@ -859,6 +859,13 @@ struct dev_ifalias { > > > char ifalias[]; > > > }; > > > > > > +struct flow_offload; > > > + > > > +enum flow_offload_type { > > > + FLOW_OFFLOAD_ADD = 0, > > > + FLOW_OFFLOAD_DEL, > > > +}; > > > + > > > /* > > > * This structure defines the management hooks for network devices. > > > * The following hooks can be defined; unless noted otherwise, they are > > > @@ -1316,6 +1323,8 @@ struct net_device_ops { > > > int (*ndo_bridge_dellink)(struct net_device *dev, > > > struct nlmsghdr *nlh, > > > u16 flags); > > > + int (*ndo_flow_offload)(enum flow_offload_type type, > > > + struct flow_offload *flow); > > > > nit: should there be kdoc for the new NDO? ndo kdoc comment doesn't > > look like it would be recognized by tools anyway though.. > > Yes, I can add this in the next iteration, no problem. > > > nit: using "flow" as the name rings slightly grandiose to me :) > > I would appreciate a nf_ prefix for clarity. Drivers will have > > to juggle a number of "flow" things, it would make the code easier > > to follow if names were prefixed clearly, I feel. > > This infrastructure could be used from tc too. My take on this is that > we should look at generalizing ndo's so they can be used from every > subsystem, so you just pick your own poison when doing packet > classification. > > With some intermediate representation that suits well for everyone, we > would save quite a bit of redundant code in the drivers, so all > frontend interfaces that are basically part of the "same world" could > call the same ndo. We just need some glue code/abstraction in between > drivers and frontends [1]. > > The other direction, that IMO I would prefer to skip, is to have one > ndo for each frontend packet classification subsystem. Unification or making an agreement on a unified API which would cover all cases would be great. My vote doesn't carry much wight but I thought I would express my preference as far as the "interim" goes :) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html