Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries

Vlad Buslov <vladbu@xxxxxxxxxx> · Tue, 15 Mar 2022 18:18:17 +0200

On Tue 15 Mar 2022 at 11:23, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote:
> On Sat, Mar 12, 2022 at 08:56:49PM +0200, Vlad Buslov wrote:
> [...]
>> Hi Pablo,
>> 
>> Thanks for reviewing my code and sorry for the late reply.
>> 
>> We explored the approach you propose and found several issues with it.
>> First, the nice benefit of implementation in this patch is that having
>> counter increment in flow_offload_add() (and test in following patch)
>> completely avoids spamming the workqueue when the limit is reached which
>> is an important concern for slower embedded DPU cores. Second, it is not
>> possible to change it when IPS_HW_OFFLOAD_BIT is set at the very end of
>> flow_offload_work_add() function because in following patch we need to
>> verify that counter is in user-specified limit before attempting
>> offload. Third, changing the counter in wq tasks makes it hard to
>> balance correctly. Consider following cases:
>> 
>> - flow_offload_work_add() can be called arbitrary amount of times per
>>   flow due to refresh logic. However, any such flow is still deleted
>>   only once.
>> 
>> - flow_offload_work_del() can be called for flows that were never
>>   actually offloaded (it is called for flows that have NF_FLOW_HW bit
>>   that is unconditionally set before attempting to schedule offload task
>>   on wq).
>>
>> Counter balancing issues could maybe be solved by carefully
>> conditionally changing it based on current value IPS_HW_OFFLOAD_BIT, but
>> spamming the workqueue can't be prevented for such design.
>>
>> > That also moves the atomic would be away from the packet path.
>> 
>> I understand your concern. However, note that this atomic is normally
>> changed once for adding offloaded flow and once for removing it. The
>> code path is only executed per-packet in error cases where flow has
>> failed to offload and refresh is called repeatedly for same flow.
>
> Thanks for explaining.
>
> There used to be in the code a list of pending flows to be offloaded.
>
> I think it would be possible to restore such list and make it per-cpu,
> the idea is to add a new field to the flow_offload structure to
> annotate the cpu that needs to deal with this flow (same cpu deals
> with add/del/stats). The cpu field is set at flow creation time.

What would be the algorithm to assign the cpu field? Some simple
algorithm like round-robin will not take into account CPU load of
unrelated tasks (for example, OvS which is also CPU-intensive) and
offload tasks on contested cores will get less CPU time, which will
result unbalanced occupancy where some cores are idle and other have
long list of offload tasks. Any advanced algorithm will be hard to
implement since we don't have access to scheduler internal data. Also,
in my experience not all offload tasks take same amount of CPU time (for
example, offloading complex flows with tunnels takes longer than simple
flows and deletes take less time than adds), so having access to just
the current lists sizes doesn't directly translate to list processing
time.

>
> Once there is one item, add work to the workqueue to that cpu.
> Meanwhile the workqueue does not have a chance, we keep adding more
> items to the workqueue.
>
> The workqueue handler then zaps the list of pending flows to be
> offloaded, it might have more than one single item in the list.

I understand the proposal but I'm missing the benefit it provides over
existing workqueue approach. Both standard kernel linked list and
workqueue are unbounded and don't count their elements which means we
would still have to implement approach similar to what is proposed in
existing series - add atomic to manually count the size and reject new
elements over some maximum (including, in case of unified list, flow
deletions that we don't really want to skip).

>
> So instead of three workqueues, we only have one. Scalability is
> achieved by fanning out flows over CPUs.

But existing nf_ft_offload_* wokrqueues are already parallel and
unbound, so they already fan out tasks over CPU cores and probably also
do it better than any custom algorithm that we can come up with since
threads are scheduled by the system scheduler that takes into account
current CPU load.