On Sat, Mar 12, 2022 at 08:56:49PM +0200, Vlad Buslov wrote: [...] > Hi Pablo, > > Thanks for reviewing my code and sorry for the late reply. > > We explored the approach you propose and found several issues with it. > First, the nice benefit of implementation in this patch is that having > counter increment in flow_offload_add() (and test in following patch) > completely avoids spamming the workqueue when the limit is reached which > is an important concern for slower embedded DPU cores. Second, it is not > possible to change it when IPS_HW_OFFLOAD_BIT is set at the very end of > flow_offload_work_add() function because in following patch we need to > verify that counter is in user-specified limit before attempting > offload. Third, changing the counter in wq tasks makes it hard to > balance correctly. Consider following cases: > > - flow_offload_work_add() can be called arbitrary amount of times per > flow due to refresh logic. However, any such flow is still deleted > only once. > > - flow_offload_work_del() can be called for flows that were never > actually offloaded (it is called for flows that have NF_FLOW_HW bit > that is unconditionally set before attempting to schedule offload task > on wq). > > Counter balancing issues could maybe be solved by carefully > conditionally changing it based on current value IPS_HW_OFFLOAD_BIT, but > spamming the workqueue can't be prevented for such design. > > > That also moves the atomic would be away from the packet path. > > I understand your concern. However, note that this atomic is normally > changed once for adding offloaded flow and once for removing it. The > code path is only executed per-packet in error cases where flow has > failed to offload and refresh is called repeatedly for same flow. Thanks for explaining. There used to be in the code a list of pending flows to be offloaded. I think it would be possible to restore such list and make it per-cpu, the idea is to add a new field to the flow_offload structure to annotate the cpu that needs to deal with this flow (same cpu deals with add/del/stats). The cpu field is set at flow creation time. Once there is one item, add work to the workqueue to that cpu. Meanwhile the workqueue does not have a chance, we keep adding more items to the workqueue. The workqueue handler then zaps the list of pending flows to be offloaded, it might have more than one single item in the list. So instead of three workqueues, we only have one. Scalability is achieved by fanning out flows over CPUs.