This is a follow-up posting for "[v1] i40e: limit the msix vectors based on housekeeping CPUs" [1] (It took longer than expected for me to get back to this). Issue ===== With the current implementation device drivers while creating their MSIX vectors only takes num_online_cpus() into consideration which works quite well for a non-RT environment, but in an RT environment that has a large number of isolated CPUs and a very few housekeeping CPUs this could lead to a problem. The problem will be triggered when something like tuned will try to move all the IRQs from isolated CPUs to the limited number of housekeeping CPUs to prevent interruptions for a latency sensitive workload that will be runing on the isolated CPUs. This failure is caused because of the per CPU vector limitation. Proposed Fix ============ In this patch-set, the following changes are proposed: - A generic API num_housekeeping_cpus() which is meant to return the available housekeeping CPUs in an environment with isolated CPUs and all online CPUs otherwise. - i40e: Specifically for the i40e driver the num_online_cpus() used in i40e_init_msix() to calculate numbers msix vectors is replaced with the above defined API. This is done to restrict the number of msix vectors for i40e in RT environments. - pci_alloc_irq_vector(): With the help of num_housekeeping_cpus() the max_vecs passed in pci_alloc_irq_vector() is restricted only to the available housekeeping CPUs only in an environment that has isolated CPUs. However, if the min_vecs exceeds the num_housekeeping_cpus(), no change is made to make sure that a device initialization is not prevented due to lack of housekeeping CPUs. Reproducing the Issue ===================== I have triggered this issue on a setup that had a total of 72 cores among which 68 were isolated and only 4 were left for housekeeping tasks. I was using tuned's realtime-virtual-host profile to configure the system. In this scenario, Tuned reported the error message 'Failed to set SMP affinity of IRQ xxx to '00000040,00000010,00000005': [Errno 28] No space left on the device' for several IRQs in tuned.log due to the per CPU vector limit. Testing ======= Functionality: - To test that the issue is resolved with i40e change I added a tracepoint in i40e_init_msix() to find the number of CPUs derived for vector creation with and without tuned's realtime-virtual-host profile. As per expectation with the profile applied I was only getting the number of housekeeping CPUs and all available CPUs without it. Performance: - To analyze the performance impact I have targetted the change introduced in pci_alloc_irq_vectors() and compared the results against a vanilla kernel (5.9.0-rc3) results. Setup Information: + I had a couple of 24-core machines connected back to back via a couple of mlx5 NICs and I analyzed the average bitrate for server-client TCP and UDP transmission via iperf. + To minimize the Bitrate variation of iperf TCP and UDP stream test I have applied the tuned's network-throughput profile and disabled HT. Test Information: + For the environment that had no isolated CPUs: I have tested with single stream and 24 streams (same as that of online CPUs). + For the environment that had 20 isolated CPUs: I have tested with single stream, 4 streams (same as that the number of housekeeping) and 24 streams (same as that of online CPUs). Results: # UDP Stream Test: + There was no degradation observed in UDP stream tests in both environments. (With isolated CPUs and without isolated CPUs after the introduction of the patches). # TCP Stream Test - No isolated CPUs: + No noticeable degradation was observed. # TCP Stream Test - With isolated CPUs: + Multiple Stream (4) - Average degradation of around 5-6% + Multiple Stream (24) - Average degradation of around 2-3% + Single Stream - Even on a vanilla kernel the Bitrate observed for a TCP single stream test seem to vary significantly across different runs (eg. the % variation between the best and the worst case on a vanilla kernel was around 8-10%). A similar variation was observed with the kernel that included my patches. No additional degradation was observed. Since the change specifically for pci_alloc_irq_vectors is going to impact several drivers I have posted this patch-set as RFC. I would be happy to perform more testing based on any suggestions or incorporate any comments to ensure that the change is not breaking anything. [1] https://lore.kernel.org/patchwork/patch/1256308/ ; Nitesh Narayan Lal (3): sched/isolation: API to get num of hosekeeping CPUs i40e: limit msix vectors based on housekeeping CPUs PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++- include/linux/pci.h | 16 ++++++++++++++ include/linux/sched/isolation.h | 7 +++++++ kernel/sched/isolation.c | 23 +++++++++++++++++++++ 4 files changed, 48 insertions(+), 1 deletion(-) -- 2.27.0