On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote: > On 2/4/21 2:06 PM, Marcelo Tosatti wrote: >>>> How about adding a new flag for isolcpus instead? >>>> >>> Do you mean a flag based on which we can switch the affinity mask to >>> housekeeping for all the devices at the time of IRQ distribution? >> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name. > > Does sounds like a nice idea to explore, lets see what Thomas thinks about it. I just read back up on that whole discussion and stared into the usage sites a bit. There are a couple of issues here in a larger picture. Looking at it from the device side first: The spreading is done for non-managed queues/interrupts which makes them movable by user space. So it could be argued from both sides that the damage done by allowing the full online mask or by allowing only the house keeping mask can be fixed up by user space. But that's the trivial part of the problem. The real problem is CPU hotplug and offline CPUs and the way how interrupts are set up for their initial affinity. As Robin noticed, the change in 1abdfe706a57 ("lib: Restrict cpumask_local_spread to houskeeping CPUs") is broken as it can return offline CPUs in both the NOHZ_FULL and the !NOHZ_FULL case. The original code is racy vs. hotplug unless the callers block hotplug. Let's look at all the callers and what they do with it. cptvf_set_irq_affinity() affinity hint safexcel_request_ring_irq() affinity hint mv_cesa_probe() affinity hint bnxt_request_irq() affinity hint nicvf_set_irq_affinity() affinity hint cxgb4_set_msix_aff() affinity hint enic_init_affinity_hint(() affinity hint iavf_request_traffic_irqs() affinity hint ionic_alloc_qcq_interrupt() affinity hint efx_set_interrupt_affinity() affinity hint i40e_vsi_request_irq_msix() affinity hint be_evt_queues_create() affinity hint, queue affinity hns3_nic_set_cpumask() affinity hint, queue affinity mlx4_en_init_affinity_hint() affinity hint, queue affinity mlx4_en_create_tx_ring() affinity hint, queue affinity set_comp_irq_affinity_hint() affinity hint, queue affinity i40e_config_xps_tx_ring() affinity hint, queue affinity hclge_configure affinity_hint, queue affinity, workqueue selection ixgbe_alloc_q_vector() node selection, affinity hint, queue affinity All of them do not care about disabling hotplug. Taking cpu_read_lock() inside of that spread function would not solve anything because once the lock is dropped the CPU can go away. There are 3 classes of this: 1) Does not matter: affinity hint 2) Might fail to set up the network queue when the selected CPU is offline. 3) Broken: The hclge driver which uses the cpu to schedule work on that cpu. That's broken, but unfortunately neither the workqueue code nor the timer code will ever notice. The work just wont be scheduled until the CPU comes online again which might be never. But looking at the above I really have to ask the question what the commit in question is actually trying to solve. AFAICT, nothing at all. Why? 1) The majority of the drivers sets the hint __after_ requesting the interrupt 2) Even if set _before_ requesting the interrupt it does not solve anything because it's a hint and the interrupt core code does not care about it at all. It provides the storage and the procfs interface nothing else. So how does that prevent the interrupt subsystem from assigning an interrupt to an isolated CPU? Not at all. Interrupts which are freshly allocated get the default interrupt affinity mask, which is either set on the command line or via /proc. The affinity of the interrupt can be changed after it has been populated in /proc. When the interrupt is requested then one of the online CPUs in it's affinity mask is chosen. X86 is special here because this also requires that there are free vectors on one of the online CPUs in the mask. If the CPUs in the affinity mask run out of vectors then it will grab a vector from some other CPU which might be an isolated CPU. When the affinity mask of the interrupt at the time when it is actually requested contains an isolated CPU then nothing prevents the kernel from steering it at an isolated CPU. But that has absolutely nothing to do with that spreading thingy. The only difference which this change makes is the fact that the affinity hint changes. Nothing else. This whole blurb about it might break isolation when an interrupt is requested is just nonsensical, really. If the default affinity mask is not correctly set up before devices are initialized then it's not going to be cured by changing that spread function. If the user space irq balancer ignores the isolation mask and blindly moves stuff to the affinity hint, then this monstrosity needs to be fixed. So I'm going to revert this commit because it _IS_ broken _AND_ useless and does not solve anything it claims to solve. Thanks, tglx