>-----Original Message----- >From: Michael Kelley <mhklinux@xxxxxxxxxxx> >Sent: Thursday, November 16, 2023 11:47 AM >To: Michael Kelley <mhklinux@xxxxxxxxxxx>; Souradeep Chakrabarti ><schakrabarti@xxxxxxxxxxxxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>; >Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; wei.liu@xxxxxxxxxx; Dexuan Cui ><decui@xxxxxxxxxxxxx>; davem@xxxxxxxxxxxxx; edumazet@xxxxxxxxxx; >kuba@xxxxxxxxxx; pabeni@xxxxxxxxxx; Long Li <longli@xxxxxxxxxxxxx>; >leon@xxxxxxxxxx; cai.huoqing@xxxxxxxxx; ssengar@xxxxxxxxxxxxxxxxxxx; >vkuznets@xxxxxxxxxx; tglx@xxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; >netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux- >rdma@xxxxxxxxxxxxxxx >Cc: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxx>; Paul Rosswurm ><paulros@xxxxxxxxxxxxx> >Subject: [EXTERNAL] RE: [PATCH net-next] net: mana: Assigning IRQ affinity on HT >cores > >[Some people who received this message don't often get email from >mhklinux@xxxxxxxxxxx. Learn why this is important at >https://aka.ms/LearnAboutSenderIdentification ] > >From: Michael Kelley <mhklinux@xxxxxxxxxxx> Sent: Wednesday, November 15, >2023 9:26 PM >> > >> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c >> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c >> > index 6367de0c2c2e..839be819d46e 100644 >> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c >> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c >> > @@ -1243,13 +1243,115 @@ void mana_gd_free_res_map(struct >> > gdma_resource *r) >> > r->size = 0; >> > } >> > >> > +static void cpu_mask_set(cpumask_var_t *filter_mask, cpumask_var_t >> > **filter_mask_list) >> > +{ >> > + unsigned int core_count = 0, cpu; >> > + cpumask_var_t *filter_mask_list_tmp; >> > + >> > + BUG_ON(!filter_mask || !filter_mask_list); >> >> This check seems superfluous. The function is invoked from only one >> call site in the code below, and that site call site can easily ensure >> that it doesn't pass a NULL value for either parameter. Fixed it in V2 patch. >> >> > + filter_mask_list_tmp = *filter_mask_list; >> > + cpumask_copy(*filter_mask, cpu_online_mask); >> > + /* for each core create a cpumask lookup table, >> > + * which stores all the corresponding siblings >> > + */ >> > + for_each_cpu(cpu, *filter_mask) { >> > + >> BUG_ON(!alloc_cpumask_var(&filter_mask_list_tmp[core_count], >> GFP_KERNEL)); >> >> Don't panic if a memory allocation fails. Must back out, clean up, >> and return an error. >> >> > + cpumask_or(filter_mask_list_tmp[core_count], >filter_mask_list_tmp[core_count], >> > + topology_sibling_cpumask(cpu)); >> >> alloc_cpumask_var() does not zero out the returned cpumask. So the >> cpumask_or() is operating on uninitialized garbage. Use >> zalloc_cpumask_var() to get a cpumask initialized to zero. >> >> But why is the cpumask_or() being done at all? Couldn't this just be >> a cpumask_copy()? In that case, alloc_cpumask_var() is OK. >> >> > + cpumask_andnot(*filter_mask, *filter_mask, >topology_sibling_cpumask(cpu)); >> > + core_count++; >> > + } >> > +} >> > + >> > +static int irq_setup(int *irqs, int nvec) { >> > + cpumask_var_t filter_mask; >> > + cpumask_var_t *filter_mask_list; >> > + unsigned int cpu_first, cpu, irq_start, cores = 0; >> > + int i, core_count = 0, numa_node, cpu_count = 0, err = 0; >> > + >> > + BUG_ON(!alloc_cpumask_var(&filter_mask, GFP_KERNEL)); >> >> Again, don't panic if a memory allocation fails. Must back out and >> return an error. >> >> > + cpus_read_lock(); >> > + cpumask_copy(filter_mask, cpu_online_mask); >> >> For readability, I'd suggest adding a blank line before any of the >> multi-line comments below. >> >> > + /* count the number of cores >> > + */ >> > + for_each_cpu(cpu, filter_mask) { >> > + cpumask_andnot(filter_mask, filter_mask, >topology_sibling_cpumask(cpu)); >> > + cores++; >> > + } >> > + filter_mask_list = kcalloc(cores, sizeof(cpumask_var_t), GFP_KERNEL); >> > + if (!filter_mask_list) { >> > + err = -ENOMEM; >> > + goto free_irq; > >One additional macro-level comment. Creating and freeing the filter_mask_list is >a real pain. But it is needed to identify which CPUs in a core are available to have >an IRQ assigned to them. So let me suggest a modified approach to meeting that >need. > >1) Count the number of cores just as before. > >2) Then instead of allocating filter_mask_list, allocate an array of integers, with one >array entry per core. Call the array core_id_list. >Somewhat like the code in cpu_mask_set(), populate the array with the CPU >number of the first CPU in each core. The populating only needs to be done once, >so the code can be inline in irq_setup(). >It doesn't need to be in a helper function like cpu_mask_set(). > >3) Allocate a single cpumask_var_t local variable that is initialized to a copy of >cpu_online_mask. Call it avail_cpus. This local variable indicates which CPUs >(across all cores) are available to have an IRQ assigned. > >4) At the beginning of the "for" loop over nvec, current code has: > > cpu_first = cpumask_first(filter_mask_list[core_count]); > >Instead, do: > > cpu_first = cpumask_first_and(&avail_cpus, > topology_sibling_cpumask(core_id_list[core_count])); > >The above code effectively creates on-the-fly the cpumask previously stored in >filter_mask_list, and finds the first CPU as before. > >5) When a CPU is assigned an IRQ, clear that CPU in avail_cpus, not in the >filter_mask_list entry. > >6) When the NUMA node wraps around and current code calls cpu_mask_set(), >instead reinitialize avail_cpus to cpu_online_mask like in my #3 above. > >In summary, with this approach filter_mask_list isn't needed at all. The messiness >of allocating and freeing an array of cpumask's goes away. I haven't coded it, but I >think the result will be simpler and less error prone. > Changed it in V2. >Michael > >> > + } >> > + /* if number of cpus are equal to max_queues per port, then >> > + * one extra interrupt for the hardware channel communication. >> > + */ >> >> Note that higher level code may set nvec based on the # of online >> CPUs, but until the cpus_read_lock is held, the # of online CPUs can >> change. So nvec might have been determined when the # of CPUs was 8, >> but 2 CPUs could go offline before the cpus_read_lock is obtained. So >> nvec could be bigger than just 1 more than num_online_cpus(). I'm not >> sure how to handle the check below to special case the hardware >> communication channel. But realize that the main "for" loop below >> could end up assigning multiple IRQs to the same CPU if nvec > >> num_online_cpus(). >> >> > + if (nvec - 1 == num_online_cpus()) { >> > + irq_start = 1; >> > + cpu_first = cpumask_first(cpu_online_mask); >> > + irq_set_affinity_and_hint(irqs[0], cpumask_of(cpu_first)); >> > + } else { >> > + irq_start = 0; >> > + } >> > + /* reset the core_count and num_node to 0. >> > + */ >> > + core_count = 0; >> > + numa_node = 0; >> > + cpu_mask_set(&filter_mask, &filter_mask_list); >> >> FWIW, passing filter_mask as the first argument above was confusing to >> me until I realized that the value of filter_mask doesn't matter -- >> it's just to use the memory allocated for filter_mask. Maybe a >> comment would help. And ditto for the invocation of cpu_mask_set() >> further down. Fixed it in V2. >> >> > + /* for each interrupt find the cpu of a particular >> > + * sibling set and if it belongs to the specific numa >> > + * then assign irq to it and clear the cpu bit from >> > + * the corresponding sibling list from filter_mask_list. >> > + * Increase the cpu_count for that node. >> > + * Once all cpus for a numa node is assigned, then >> > + * move to different numa node and continue the same. >> > + */ >> > + for (i = irq_start; i < nvec; ) { >> > + cpu_first = cpumask_first(filter_mask_list[core_count]); >> > + if (cpu_first < nr_cpu_ids && cpu_to_node(cpu_first) == >> > + numa_node) { >> >> Note that it's possible to have a NUMA node with zero online CPUs. >> It could be a NUMA node that was originally a memory-only NUMA node >> and never had any CPUs, or the NUMA node could have had CPUs, but they >> were all later taken offline. Such a NUMA node is still considered to >> be online because it has memory, but it has no online CPUs. >> >> If this code ever sets "numa_node" to such a NUMA node, the "for" loop >> will become an infinite loop because the "if" >> statement above will never find a CPU that is assigned to "numa_node". >> >> > + irq_set_affinity_and_hint(irqs[i], cpumask_of(cpu_first)); >> > + cpumask_clear_cpu(cpu_first, filter_mask_list[core_count]); >> > + cpu_count = cpu_count + 1; >> > + i = i + 1; >> > + /* checking if all the cpus are used from the >> > + * particular node. >> > + */ >> > + if (cpu_count == nr_cpus_node(numa_node)) { >> > + numa_node = numa_node + 1; >> > + if (numa_node == num_online_nodes()) { >> > + cpu_mask_set(&filter_mask, >> > + &filter_mask_list); >> >> The second and subsequent calls to cpu_mask_set() will leak any memory >> previously allocated by alloc_cpumask_var() in cpu_mask_set(). >> >> I agree with Haiyang's comment about starting with a NUMA node other >> than NUMA node zero. But if you do so, note that testing for >> wrap-around at num_online_nodes() won't be equivalent to testing for >> getting back to the starting NUMA node. >> You really want to run cpu_mask_set() again only when you get back to >> the starting NUMA node. Fixed it in V2. >> >> > + numa_node = 0; >> > + } >> > + cpu_count = 0; >> > + core_count = 0; >> > + continue; >> > + } >> > + } >> > + if ((core_count + 1) % cores == 0) >> > + core_count = 0; >> > + else >> > + core_count++; >> >> The if/else could be more compactly written as: >> >> if (++core_count == cores) >> core_count = 0; >> >> > + } >> > +free_irq: >> > + cpus_read_unlock(); >> > + if (filter_mask) >> >> This check for non-NULL filter_mask is incorrect and might not even >> compile if CONFIG_CPUMASK_OFFSTACK=n. For testing purposes, you >> should make sure to build and run with each >> option: with CONFIG_CPUMASK_OFFSTACK=y and also >> CONFIG_CPUMASK_OFFSTACK=n. Fixed it in V2. >> >> It's safe to just call free_cpumask_var() without the check. >> >> > + free_cpumask_var(filter_mask); >> > + if (filter_mask_list) { >> > + for (core_count = 0; core_count < cores; core_count++) >> > + free_cpumask_var(filter_mask_list[core_count]); >> > + kfree(filter_mask_list); >> > + } >> > + return err; >> > +}