Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries

Vlad Buslov <vladbu@xxxxxxxxxx> · Sat, 12 Mar 2022 20:56:49 +0200

On Mon 07 Mar 2022 at 22:47, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote:
> On Tue, Feb 22, 2022 at 05:09:57PM +0200, Vlad Buslov wrote:
>> To improve hardware offload debuggability and allow capping total amount of
>> offloaded entries in following patch extend struct netns_nftables with
>> 'count_hw' counter and expose it to userspace as 'nf_flowtable_count_hw'
>> sysctl entry. Increment the counter together with setting NF_FLOW_HW flag
>> when scheduling offload add task on workqueue and decrement it after
>> successfully scheduling offload del task.
>> 
>> Signed-off-by: Vlad Buslov <vladbu@xxxxxxxxxx>
>> Signed-off-by: Oz Shlomo <ozsh@xxxxxxxxxx>
>> Reviewed-by: Paul Blakey <paulb@xxxxxxxxxx>
>> ---
>>  include/net/netns/nftables.h            |  1 +
>>  net/netfilter/nf_conntrack_standalone.c | 12 ++++++++++++
>>  net/netfilter/nf_flow_table_core.c      | 12 ++++++++++--
>>  3 files changed, 23 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
>> index 8c77832d0240..262b8b3213cb 100644
>> --- a/include/net/netns/nftables.h
>> +++ b/include/net/netns/nftables.h
>> @@ -6,6 +6,7 @@
>>  
>>  struct netns_nftables {
>>  	u8			gencursor;
>> +	atomic_t		count_hw;
>>  };
>>  
>>  #endif
>> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
>> index 3e1afd10a9b6..8cd55d502ffd 100644
>> --- a/net/netfilter/nf_conntrack_standalone.c
>> +++ b/net/netfilter/nf_conntrack_standalone.c
>> @@ -618,6 +618,9 @@ enum nf_ct_sysctl_index {
>>  #ifdef CONFIG_LWTUNNEL
>>  	NF_SYSCTL_CT_LWTUNNEL,
>>  #endif
>> +#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>> +	NF_SYSCTL_CT_COUNT_HW,
>> +#endif
>>  
>>  	__NF_SYSCTL_CT_LAST_SYSCTL,
>>  };
>> @@ -973,6 +976,14 @@ static struct ctl_table nf_ct_sysctl_table[] = {
>>  		.mode		= 0644,
>>  		.proc_handler	= nf_hooks_lwtunnel_sysctl_handler,
>>  	},
>> +#endif
>> +#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>> +	[NF_SYSCTL_CT_COUNT_HW] = {
>> +		.procname	= "nf_flowtable_count_hw",
>> +		.maxlen		= sizeof(int),
>> +		.mode		= 0444,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>>  #endif
>>  	{}
>>  };
>> @@ -1111,6 +1122,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>>  	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_STREAM].data = &un->timeouts[UDP_CT_REPLIED];
>>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>>  	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout;
>> +	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
>>  #endif
>>  
>>  	nf_conntrack_standalone_init_tcp_sysctl(net, table);
>> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
>> index b90eca7a2f22..9f2b68bc6f80 100644
>> --- a/net/netfilter/nf_flow_table_core.c
>> +++ b/net/netfilter/nf_flow_table_core.c
>> @@ -315,6 +315,9 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
>>  	nf_ct_offload_timeout(flow->ct);
>>  
>>  	if (nf_flowtable_hw_offload(flow_table)) {
>> +		struct net *net = read_pnet(&flow_table->net);
>> +
>> +		atomic_inc(&net->nft.count_hw);
>>  		__set_bit(NF_FLOW_HW, &flow->flags);
>>  		nf_flow_offload_add(flow_table, flow);
>
> Better increment this new counter from flow_offload_work_add()? There
> is offload->flowtable->net, you could use it from there to bump the
> counter once this is placed in hardware (by when IPS_HW_OFFLOAD_BIT is
> set on).

Hi Pablo,

Thanks for reviewing my code and sorry for the late reply.

We explored the approach you propose and found several issues with it.
First, the nice benefit of implementation in this patch is that having
counter increment in flow_offload_add() (and test in following patch)
completely avoids spamming the workqueue when the limit is reached which
is an important concern for slower embedded DPU cores. Second, it is not
possible to change it when IPS_HW_OFFLOAD_BIT is set at the very end of
flow_offload_work_add() function because in following patch we need to
verify that counter is in user-specified limit before attempting
offload. Third, changing the counter in wq tasks makes it hard to
balance correctly. Consider following cases:

- flow_offload_work_add() can be called arbitrary amount of times per
  flow due to refresh logic. However, any such flow is still deleted
  only once.

- flow_offload_work_del() can be called for flows that were never
  actually offloaded (it is called for flows that have NF_FLOW_HW bit
  that is unconditionally set before attempting to schedule offload task
  on wq).

Counter balancing issues could maybe be solved by carefully
conditionally changing it based on current value IPS_HW_OFFLOAD_BIT, but
spamming the workqueue can't be prevented for such design.

>
> That also moves the atomic would be away from the packet path.

I understand your concern. However, note that this atomic is normally
changed once for adding offloaded flow and once for removing it. The
code path is only executed per-packet in error cases where flow has
failed to offload and refresh is called repeatedly for same flow.

>
>>  	}
>> @@ -462,10 +465,15 @@ static void nf_flow_offload_gc_step(struct flow_offload *flow, void *data)
>>  
>>  	if (test_bit(NF_FLOW_TEARDOWN, &flow->flags)) {
>>  		if (test_bit(NF_FLOW_HW, &flow->flags)) {
>> -			if (!test_bit(NF_FLOW_HW_DYING, &flow->flags))
>> +			if (!test_bit(NF_FLOW_HW_DYING, &flow->flags)) {
>> +				struct net *net = read_pnet(&flow_table->net);
>> +
>>  				nf_flow_offload_del(flow_table, flow);
>> -			else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags))
>> +				if (test_bit(NF_FLOW_HW_DYING, &flow->flags))
>> +					atomic_dec(&net->nft.count_hw);
>
> Same with this, I'd suggest to decrement it from flow_offload_work_del().
>
>> +			} else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags)) {
>>  				flow_offload_del(flow_table, flow);
>> +			}
>>  		} else {
>>  			flow_offload_del(flow_table, flow);
>>  		}
>> -- 
>> 2.31.1
>>