Re: [PATCH nf-next,RFC v4] netfilter: nf_flow_table: add hardware offload support

Jakub Kicinski <kubakici@xxxxx> · Thu, 25 Jan 2018 14:38:46 -0800

On Thu, 25 Jan 2018 12:28:58 +0100, Pablo Neira Ayuso wrote:
> On Wed, Jan 24, 2018 at 05:31:36PM -0800, Jakub Kicinski wrote:
> > On Thu, 25 Jan 2018 01:09:41 +0100, Pablo Neira Ayuso wrote:  
> > > This patch adds the infrastructure to offload flows to hardware, in case
> > > the nic/switch comes with built-in flow tables capabilities.
> > > 
> > > If the hardware comes with no hardware flow tables or they have
> > > limitations in terms of features, the existing infrastructure falls back
> > > to the software flow table implementation.
> > > 
> > > The software flow table garbage collector skips entries that resides in
> > > the hardware, so the hardware will be responsible for releasing this
> > > flow table entry too via flow_offload_dead().
> > > 
> > > Hardware configuration, either to add or to delete entries, is done from
> > > the hardware offload workqueue, to ensure this is done from user context
> > > given that we may sleep when grabbing the mdio mutex.
> > > 
> > > Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>  
> > 
> > I wonder how do you deal with device/table removal?  I know regrettably
> > little about internals of nftables.  I assume the table cannot be
> > removed/module unloaded as long as there are flow entries? And on
> > device removal all flows pertaining to the removed ifindex will be
> > automatically flushed?  
> 
> Yes, this code is part of the generic software infrastructure, it's
> not specific to the hardware offload, it's already upstream, see
> net/netfilter/nft_flow_offload.c, see flow_offload_netdev_notifier.

Hm.  At a glance flow_offload_iterate_cleanup() will just mark the
flows as dead, not request their removal from the HW.  Doesn't that
mean that reloading the HW driver with flows installed will likely lead
to HW/FW resources being leaked (unless every driver duplicates manual
flush on remove).

> > Still there could be outstanding work items targeting the device, so
> > this WARN_ON:
> > 
> > +	indev = dev_get_by_index(net, ifindex);
> > +	if (WARN_ON(!indev))
> > +		return 0;
> > 
> > looks possible to trigger.  
> 
> It should not, that's why there's a WARN_ON there ;-).
> 
> See nf_flow_table_hw_module_exit(), there's a call to
> cancel_work_sync() to stop the hw offload workqueue, then flushes it.
> After this, there's a flow table cleanup. So noone should be calling
> that function by then.

Ah, I must be misunderstanding.  I meant when device is removed, not
the flow_table_hw module.  Does the nf_flow_table_hw_module_exit() run
when device is removed?  I was expecting that, for example something
like nft_flow_offload_iterate_cleanup() would queue up all the flow
remove calls and then call flush_work() (not cancel_work). 

> > On the general architecture - I think it's worth documenting somewhere
> > clearly that unlike TC offloads and most NDOs add/del of NFT flows are
> > not protected by rtnl_lock.  
> 
> Someone could probably look at getting rid of the rtnl_lock() all over
> the place for hardware offloads, holding on the entire rtnetlink
> subsystem just because some piece of hardware is taking time to
> configure things is not good. Not explicitly related to this, but have
> a look at Florian Westphal's talk on rtnl_lock during NetDev.

Oh, 100% agreed.  I was just pointing out that it could be useful to
mention the locking in kdoc or at least the commit message.

> > > v4: More work in progress
> > > - Decouple nf_flow_table_hw from nft_flow_offload via rcu hooks
> > > - Consolidate ->ndo invocations, now they happen from the hw worker.
> > > - Fix bug in list handling, use list_replace_init()
> > > - cleanup entries on nf_flow_table_hw module removal
> > > - add NFT_FLOWTABLE_F_HW flag to flowtables to explicit signal that user wants
> > >   to offload entries to hardware.
> > > 
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index ed0799a12bf2..be0c12acc3f0 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -859,6 +859,13 @@ struct dev_ifalias {
> > >  	char ifalias[];
> > >  };
> > >  
> > > +struct flow_offload;
> > > +
> > > +enum flow_offload_type {
> > > +	FLOW_OFFLOAD_ADD	= 0,
> > > +	FLOW_OFFLOAD_DEL,
> > > +};
> > > +
> > >  /*
> > >   * This structure defines the management hooks for network devices.
> > >   * The following hooks can be defined; unless noted otherwise, they are
> > > @@ -1316,6 +1323,8 @@ struct net_device_ops {
> > >  	int			(*ndo_bridge_dellink)(struct net_device *dev,
> > >  						      struct nlmsghdr *nlh,
> > >  						      u16 flags);
> > > +	int			(*ndo_flow_offload)(enum flow_offload_type type,
> > > +						    struct flow_offload *flow);  
> > 
> > nit: should there be kdoc for the new NDO?  ndo kdoc comment doesn't
> >      look like it would be recognized by tools anyway though..  
> 
> Yes, I can add this in the next iteration, no problem.
> 
> > nit: using "flow" as the name rings slightly grandiose to me :)  
> >      I would appreciate a nf_ prefix for clarity.  Drivers will have 
> >      to juggle a number of "flow" things, it would make the code easier
> >      to follow if names were prefixed clearly, I feel.  
> 
> This infrastructure could be used from tc too. My take on this is that
> we should look at generalizing ndo's so they can be used from every
> subsystem, so you just pick your own poison when doing packet
> classification.
> 
> With some intermediate representation that suits well for everyone, we
> would save quite a bit of redundant code in the drivers, so all
> frontend interfaces that are basically part of the "same world" could
> call the same ndo. We just need some glue code/abstraction in between
> drivers and frontends [1].
>
> The other direction, that IMO I would prefer to skip, is to have one
> ndo for each frontend packet classification subsystem.

Unification or making an agreement on a unified API which would cover
all cases would be great.  

My vote doesn't carry much wight but I thought I would express my
preference as far as the "interim" goes :)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html