On 2/5/21 5:23 PM, Thomas Gleixner wrote: > On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote: >> On 2/4/21 2:06 PM, Marcelo Tosatti wrote: >>>>> How about adding a new flag for isolcpus instead? >>>>> >>>> Do you mean a flag based on which we can switch the affinity mask to >>>> housekeeping for all the devices at the time of IRQ distribution? >>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name. >> Does sounds like a nice idea to explore, lets see what Thomas thinks about it. > I just read back up on that whole discussion and stared into the usage > sites a bit. > > There are a couple of issues here in a larger picture. Looking at it > from the device side first: > > The spreading is done for non-managed queues/interrupts which makes them > movable by user space. So it could be argued from both sides that the > damage done by allowing the full online mask or by allowing only the > house keeping mask can be fixed up by user space. > > But that's the trivial part of the problem. The real problem is CPU > hotplug and offline CPUs and the way how interrupts are set up for their > initial affinity. > > As Robin noticed, the change in 1abdfe706a57 ("lib: Restrict > cpumask_local_spread to houskeeping CPUs") is broken as it can return > offline CPUs in both the NOHZ_FULL and the !NOHZ_FULL case. A quick question here, is there any reason why we don't have cpu_online_mask instead of cpu_possible_mask in the housekeeping_cpumask()? (not for this particular patch but in general) > > The original code is racy vs. hotplug unless the callers block hotplug. > > Let's look at all the callers and what they do with it. > > cptvf_set_irq_affinity() affinity hint > safexcel_request_ring_irq() affinity hint > mv_cesa_probe() affinity hint > bnxt_request_irq() affinity hint > nicvf_set_irq_affinity() affinity hint > cxgb4_set_msix_aff() affinity hint > enic_init_affinity_hint(() affinity hint > iavf_request_traffic_irqs() affinity hint > ionic_alloc_qcq_interrupt() affinity hint > efx_set_interrupt_affinity() affinity hint > i40e_vsi_request_irq_msix() affinity hint > > be_evt_queues_create() affinity hint, queue affinity > hns3_nic_set_cpumask() affinity hint, queue affinity > mlx4_en_init_affinity_hint() affinity hint, queue affinity > mlx4_en_create_tx_ring() affinity hint, queue affinity > set_comp_irq_affinity_hint() affinity hint, queue affinity > i40e_config_xps_tx_ring() affinity hint, queue affinity > > hclge_configure affinity_hint, queue affinity, workqueue selection > > ixgbe_alloc_q_vector() node selection, affinity hint, queue affinity > > All of them do not care about disabling hotplug. Taking cpu_read_lock() > inside of that spread function would not solve anything because once the > lock is dropped the CPU can go away. > > There are 3 classes of this: > > 1) Does not matter: affinity hint > > 2) Might fail to set up the network queue when the selected CPU > is offline. > > 3) Broken: The hclge driver which uses the cpu to schedule work on > that cpu. That's broken, but unfortunately neither the workqueue > code nor the timer code will ever notice. The work just wont be > scheduled until the CPU comes online again which might be never. Agreed. > But looking at the above I really have to ask the question what the > commit in question is actually trying to solve. > > AFAICT, nothing at all. Why? > > 1) The majority of the drivers sets the hint __after_ requesting the > interrupt > > 2) Even if set _before_ requesting the interrupt it does not solve > anything because it's a hint and the interrupt core code does > not care about it at all. It provides the storage and the procfs > interface nothing else. > > So how does that prevent the interrupt subsystem from assigning an > interrupt to an isolated CPU? Not at all. > > Interrupts which are freshly allocated get the default interrupt > affinity mask, which is either set on the command line or via /proc. The > affinity of the interrupt can be changed after it has been populated in > /proc. > > When the interrupt is requested then one of the online CPUs in it's > affinity mask is chosen. > > X86 is special here because this also requires that there are free > vectors on one of the online CPUs in the mask. If the CPUs in the > affinity mask run out of vectors then it will grab a vector from some > other CPU which might be an isolated CPU. > > When the affinity mask of the interrupt at the time when it is actually > requested contains an isolated CPU then nothing prevents the kernel from > steering it at an isolated CPU. But that has absolutely nothing to do > with that spreading thingy. > > The only difference which this change makes is the fact that the > affinity hint changes. Nothing else. > Thanks for the detailed explanation. Before I posted this patch, I was doing some debugging on a setup where I was observing some latency issues due to the iavf IRQs that were pinned on the isolated CPUs. Based on some initial traces I had this impression that the affinity hint or cpumask_local_spread was somehow playing a role in deciding the affinity mask of these IRQs. Although, that does look incorrect after going through your explanation. For some reason, with a kernel that had this patch when I tried creating VFs iavf IRQs always ended up on the HK CPUs. The reasoning for the above is still not very clear to me. I will investigate this further to properly understand this behavior. -- Nitesh