Re: [PATCH] ACPI: cpufreq: use a platform device to load ACPI PPC and PCC drivers

Petr Pavlu <petr.pavlu@xxxxxxxx> · Sun, 12 Feb 2023 16:44:47 +0100

On 2/6/23 17:28, Luis Chamberlain wrote:
> On Tue, Jan 31, 2023 at 02:00:41PM +0100, Petr Pavlu wrote:
>> The acpi-cpufreq and pcc-cpufreq drivers are loaded through per-CPU
>> module aliases.
> 
> It gets me to question, what makes this "per-CPU module aliases" and
> how do we find other similar uses as they are likely incorrect too?

The acpi-cpufreq module aliases:
* 'cpu:type:x86,ven*fam*mod*:feature:*0016*' (X86_FEATURE_HW_PSTATE),
* 'cpu:type:x86,ven*fam*mod*:feature:*00E8*' (X86_FEATURE_ACPI),
* 'acpi*:LNXCPU:*' (ACPI_PROCESSOR_OBJECT_HID),
* 'acpi*:ACPI0007:*' (ACPI_PROCESSOR_DEVICE_HID).

The pcc-cpufreq similarly aliases:
* 'acpi*:LNXCPU:*' (ACPI_PROCESSOR_OBJECT_HID),
* 'acpi*:ACPI0007:*' (ACPI_PROCESSOR_DEVICE_HID).

These per-CPU aliases can be found as follows:
$ grep -E '(cpu:type|ACPI0007|LNXCPU)' /lib/modules/$(uname -r)/modules.alias

I think it is generally ok for modules to use per-CPU aliases. The normal case
should be that the first load succeeds and any subsequent one is then only
a trivial check by libkmod that a given module has been already inserted.

This breaks when a module repeatedly fails to load. In particular, arrangement
of x86 cpufreq and edac drivers is done in a way that they cascade several
modules and once the first one becomes a selected cpufreq/edac driver then any
further module load fails.

If a subsystem decides to use this kind of pattern then it should really limit
the number of such failed loads. The proposed patch improves the situation for
cpufreq. I've not yet looked at the edac stuff.

The second problem with per-CPU aliases is a burst of same loads created by
udevd (as also discussed on the linux-modules mailing list). The daemon
normally processes devices in parallel and as it goes through CPUs, it will
create a number of requests to load the same per-CPU module. This problem is
present even if the needed module is loaded successfully. A lot of unnecessary
CPU work and memory allocations is needed to handle each duplicate load.

My thinking is that this second problem is something that should be addressed
in udevd. The same situation can happen for other hardware devices when they
all have the same type and need same modules. Udevd could avoid attempting
a second parallel load if it recognizes that one of its workers is already
doing the same insert. I hope to have a closer look at this.

There is clearly some overlap in sense that if udevd was improved in this way
then it makes this patch less useful. However, even if this would get
implemented at some point, I think it's still good if acpi-cpufreq and
pcc-cpufreq is on a given system explicitly attempted to be loaded only once.

>> This can result in many unnecessary load requests during
>> boot if another frequency module, such as intel_pstate, is already
>> active. 
> 
> Perhaps you should mention that in the worst case, without the fix in
> commit 0254127ab977e ("module: Don't wait for GOING modules") some of
> these module load requests could fail and prevent boot, and that
> discussion over these duplicate module reqests ended up in us deciding that
> they are not needed, we just need one.

Ok, I'll add it in v2.

>> For instance, on a typical Intel system, one can observe that
>> udev makes 2x#CPUs attempts to insert acpi_cpufreq and 1x#CPUs attempts
>> for pcc_cpufreq. All these tries then fail if another frequency module
>> is already registered.
>>
>> Both acpi-cpufreq and pcc-cpufreq drivers have their platform firmware
>> interface defined by ACPI. Allowed performance states and parameters
>> must be same for each CPU. This makes it possible to model these
>> interfaces as platform devices.
>>
>> The patch extends the ACPI parsing logic to check the ACPI namespace if
>> the PPC or PCC interface is present and creates a virtual platform
>> device for each if it is available. The acpi-cpufreq and pcc-cpufreq
>> drivers are then updated to map to these devices.
>>
>> This allows to try loading acpi-cpufreq and pcc-cpufreq only once during
>> boot and only if a given interface is available in the firmware.
> 
> That could cut boot time too? If so how much?

The following command should provide an approximate measurement of what
happens during the boot:

# perf stat --null --repeat 10 --post 'sleep 5' -- \
    udevadm trigger --action=add --subsystem-match=acpi --sysname-match=LNXCPU* --subsystem-match=cpu --sysname-match=cpu* --settle

My test was with 6.2-rc7 which already contains commit 0254127ab977e ("module:
Don't wait for GOING modules"). Kernel modules were signed and compressed with
zstd. Test machines had intel_pstate loaded and so acpi-cpufreq and
pcc-cpufreq always failed to initialize.

Test machine A, with 16 CPUs + 62 GiB memory:
* vanilla:        0.20334 +- 0.00566 seconds time elapsed  ( +-  2.79% )
* with the patch: 0.08939 +- 0.00308 seconds time elapsed  ( +-  3.44% )

Test machine B, with 288 CPUs + 110 GiB memory.
* vanilla:        1.3439 +- 0.0232 seconds time elapsed  ( +-  1.72% )
* with the patch: 0.74916 +- 0.00253 seconds time elapsed  ( +-  0.34% )

Udevd processing was parallel, maximum number of its workers was in both cases
2*cpu_count+16. It means the reduced time roughly says how much less the whole
system (all CPUs) was needed to be occupied.

The smaller system A sees a reduction of ~100 ms with the patch while the
bigger one B shows an improvement of ~600 ms.

Thanks,
Petr