On 2/6/23 17:28, Luis Chamberlain wrote: > On Tue, Jan 31, 2023 at 02:00:41PM +0100, Petr Pavlu wrote: >> The acpi-cpufreq and pcc-cpufreq drivers are loaded through per-CPU >> module aliases. > > It gets me to question, what makes this "per-CPU module aliases" and > how do we find other similar uses as they are likely incorrect too? The acpi-cpufreq module aliases: * 'cpu:type:x86,ven*fam*mod*:feature:*0016*' (X86_FEATURE_HW_PSTATE), * 'cpu:type:x86,ven*fam*mod*:feature:*00E8*' (X86_FEATURE_ACPI), * 'acpi*:LNXCPU:*' (ACPI_PROCESSOR_OBJECT_HID), * 'acpi*:ACPI0007:*' (ACPI_PROCESSOR_DEVICE_HID). The pcc-cpufreq similarly aliases: * 'acpi*:LNXCPU:*' (ACPI_PROCESSOR_OBJECT_HID), * 'acpi*:ACPI0007:*' (ACPI_PROCESSOR_DEVICE_HID). These per-CPU aliases can be found as follows: $ grep -E '(cpu:type|ACPI0007|LNXCPU)' /lib/modules/$(uname -r)/modules.alias I think it is generally ok for modules to use per-CPU aliases. The normal case should be that the first load succeeds and any subsequent one is then only a trivial check by libkmod that a given module has been already inserted. This breaks when a module repeatedly fails to load. In particular, arrangement of x86 cpufreq and edac drivers is done in a way that they cascade several modules and once the first one becomes a selected cpufreq/edac driver then any further module load fails. If a subsystem decides to use this kind of pattern then it should really limit the number of such failed loads. The proposed patch improves the situation for cpufreq. I've not yet looked at the edac stuff. The second problem with per-CPU aliases is a burst of same loads created by udevd (as also discussed on the linux-modules mailing list). The daemon normally processes devices in parallel and as it goes through CPUs, it will create a number of requests to load the same per-CPU module. This problem is present even if the needed module is loaded successfully. A lot of unnecessary CPU work and memory allocations is needed to handle each duplicate load. My thinking is that this second problem is something that should be addressed in udevd. The same situation can happen for other hardware devices when they all have the same type and need same modules. Udevd could avoid attempting a second parallel load if it recognizes that one of its workers is already doing the same insert. I hope to have a closer look at this. There is clearly some overlap in sense that if udevd was improved in this way then it makes this patch less useful. However, even if this would get implemented at some point, I think it's still good if acpi-cpufreq and pcc-cpufreq is on a given system explicitly attempted to be loaded only once. >> This can result in many unnecessary load requests during >> boot if another frequency module, such as intel_pstate, is already >> active. > > Perhaps you should mention that in the worst case, without the fix in > commit 0254127ab977e ("module: Don't wait for GOING modules") some of > these module load requests could fail and prevent boot, and that > discussion over these duplicate module reqests ended up in us deciding that > they are not needed, we just need one. Ok, I'll add it in v2. >> For instance, on a typical Intel system, one can observe that >> udev makes 2x#CPUs attempts to insert acpi_cpufreq and 1x#CPUs attempts >> for pcc_cpufreq. All these tries then fail if another frequency module >> is already registered. >> >> Both acpi-cpufreq and pcc-cpufreq drivers have their platform firmware >> interface defined by ACPI. Allowed performance states and parameters >> must be same for each CPU. This makes it possible to model these >> interfaces as platform devices. >> >> The patch extends the ACPI parsing logic to check the ACPI namespace if >> the PPC or PCC interface is present and creates a virtual platform >> device for each if it is available. The acpi-cpufreq and pcc-cpufreq >> drivers are then updated to map to these devices. >> >> This allows to try loading acpi-cpufreq and pcc-cpufreq only once during >> boot and only if a given interface is available in the firmware. > > That could cut boot time too? If so how much? The following command should provide an approximate measurement of what happens during the boot: # perf stat --null --repeat 10 --post 'sleep 5' -- \ udevadm trigger --action=add --subsystem-match=acpi --sysname-match=LNXCPU* --subsystem-match=cpu --sysname-match=cpu* --settle My test was with 6.2-rc7 which already contains commit 0254127ab977e ("module: Don't wait for GOING modules"). Kernel modules were signed and compressed with zstd. Test machines had intel_pstate loaded and so acpi-cpufreq and pcc-cpufreq always failed to initialize. Test machine A, with 16 CPUs + 62 GiB memory: * vanilla: 0.20334 +- 0.00566 seconds time elapsed ( +- 2.79% ) * with the patch: 0.08939 +- 0.00308 seconds time elapsed ( +- 3.44% ) Test machine B, with 288 CPUs + 110 GiB memory. * vanilla: 1.3439 +- 0.0232 seconds time elapsed ( +- 1.72% ) * with the patch: 0.74916 +- 0.00253 seconds time elapsed ( +- 0.34% ) Udevd processing was parallel, maximum number of its workers was in both cases 2*cpu_count+16. It means the reduced time roughly says how much less the whole system (all CPUs) was needed to be occupied. The smaller system A sees a reduction of ~100 ms with the patch while the bigger one B shows an improvement of ~600 ms. Thanks, Petr