Re: [PATCH v7 02/16] i386/cpu: Use APIC ID offset to encode cache topo in CPUID[4]

Zhao Liu <zhao1.liu@xxxxxxxxxxxxxxx> · Mon, 15 Jan 2024 11:04:33 +0800

Hi Xiaoyao,

On Sun, Jan 14, 2024 at 10:11:59PM +0800, Xiaoyao Li wrote:
> Date: Sun, 14 Jan 2024 22:11:59 +0800
> From: Xiaoyao Li <xiaoyao.li@xxxxxxxxx>
> Subject: Re: [PATCH v7 02/16] i386/cpu: Use APIC ID offset to encode cache
>  topo in CPUID[4]
> 
> On 1/11/2024 4:43 PM, Zhao Liu wrote:
> > Hi Xiaoyao,
> > 
> > On Wed, Jan 10, 2024 at 05:31:28PM +0800, Xiaoyao Li wrote:
> > > Date: Wed, 10 Jan 2024 17:31:28 +0800
> > > From: Xiaoyao Li <xiaoyao.li@xxxxxxxxx>
> > > Subject: Re: [PATCH v7 02/16] i386/cpu: Use APIC ID offset to encode cache
> > >   topo in CPUID[4]
> > > 
> > > On 1/8/2024 4:27 PM, Zhao Liu wrote:
> > > > From: Zhao Liu <zhao1.liu@xxxxxxxxx>
> > > > 
> > > > Refer to the fixes of cache_info_passthrough ([1], [2]) and SDM, the
> > > > CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26] should use the
> > > > nearest power-of-2 integer.
> > > > 
> > > > The nearest power-of-2 integer can be calculated by pow2ceil() or by
> > > > using APIC ID offset (like L3 topology using 1 << die_offset [3]).
> > > > 
> > > > But in fact, CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26]
> > > > are associated with APIC ID. For example, in linux kernel, the field
> > > > "num_threads_sharing" (Bits 25 - 14) is parsed with APIC ID.
> > > 
> > > And for
> > > > another example, on Alder Lake P, the CPUID.04H:EAX[bits 31:26] is not
> > > > matched with actual core numbers and it's calculated by:
> > > > "(1 << (pkg_offset - core_offset)) - 1".
> > > 
> > > could you elaborate it more? what is the value of actual core numbers on
> > > Alder lake P? and what is the pkg_offset and core_offset?
> > 
> > For example, the following's the CPUID dump of an ADL-S machine:
> > 
> > CPUID.04H:
> > 
> > 0x00000004 0x00: eax=0xfc004121 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000
> > 0x00000004 0x01: eax=0xfc004122 ebx=0x01c0003f ecx=0x0000007f edx=0x00000000
> > 0x00000004 0x02: eax=0xfc01c143 ebx=0x03c0003f ecx=0x000007ff edx=0x00000000
> > 0x00000004 0x03: eax=0xfc1fc163 ebx=0x0240003f ecx=0x00009fff edx=0x00000004
> > 0x00000004 0x04: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
> > 
> > 
> > CPUID.1FH:
> > 
> > 0x0000001f 0x00: eax=0x00000001 ebx=0x00000001 ecx=0x00000100 edx=0x0000004c
> > 0x0000001f 0x01: eax=0x00000007 ebx=0x00000014 ecx=0x00000201 edx=0x0000004c
> > 0x0000001f 0x02: eax=0x00000000 ebx=0x00000000 ecx=0x00000002 edx=0x0000004c
> > 
> > The CPUID.04H:EAX[bits 31:26] is 63.
> >  From CPUID.1FH.00H:EAX[bits 04:00], the core_offset is 1, and from
> > CPUID.1FH.01H:EAX[bits 04:00], the pkg_offset is 7.
> > 
> > Thus we can verify that the above equation as:
> > 
> > 1 << (0x7 - 0x1) - 1 = 63.
> > 
> > "Maximum number of addressable IDs" refers to the maximum number of IDs
> > that can be enumerated in the APIC ID's topology layout, which does not
> > necessarily correspond to the actual number of topology domains.
> > 
> > > 
> > > > Therefore the offset of APIC ID should be preferred to calculate nearest
> > > > power-of-2 integer for CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits
> > > > 31:26]:
> > > > 1. d/i cache is shared in a core, 1 << core_offset should be used
> > > >      instand of "cs->nr_threads" in encode_cache_cpuid4() for
> > > 
> > > /s/instand/instead
> > 
> > Thanks!
> > 
> > > 
> > > >      CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14].
> > > > 2. L2 cache is supposed to be shared in a core as for now, thereby
> > > >      1 << core_offset should also be used instand of "cs->nr_threads" in
> > > 
> > > ditto
> > 
> > Okay.
> > 
> > > 
> > > >      encode_cache_cpuid4() for CPUID.04H.02H:EAX[bits 25:14].
> > > > 3. Similarly, the value for CPUID.04H:EAX[bits 31:26] should also be
> > > >      calculated with the bit width between the Package and SMT levels in
> > > >      the APIC ID (1 << (pkg_offset - core_offset) - 1).
> > > > 
> > > > In addition, use APIC ID offset to replace "pow2ceil()" for
> > > > cache_info_passthrough case.
> > > > 
> > > > [1]: efb3934adf9e ("x86: cpu: make sure number of addressable IDs for processor cores meets the spec")
> > > > [2]: d7caf13b5fcf ("x86: cpu: fixup number of addressable IDs for logical processors sharing cache")
> > > > [3]: d65af288a84d ("i386: Update new x86_apicid parsing rules with die_offset support")
> > > > 
> > > > Fixes: 7e3482f82480 ("i386: Helpers to encode cache information consistently")
> > > > Suggested-by: Robert Hoo <robert.hu@xxxxxxxxxxxxxxx>
> > > > Signed-off-by: Zhao Liu <zhao1.liu@xxxxxxxxx>
> > > > Tested-by: Babu Moger <babu.moger@xxxxxxx>
> > > > Tested-by: Yongwei Ma <yongwei.ma@xxxxxxxxx>
> > > > Acked-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> > > > ---
> > > > Changes since v3:
> > > >    * Fix compile warnings. (Babu)
> > > >    * Fix spelling typo.
> > > > 
> > > > Changes since v1:
> > > >    * Use APIC ID offset to replace "pow2ceil()" for cache_info_passthrough
> > > >      case. (Yanan)
> > > >    * Split the L1 cache fix into a separate patch.
> > > >    * Rename the title of this patch (the original is "i386/cpu: Fix number
> > > >      of addressable IDs in CPUID.04H").
> > > > ---
> > > >    target/i386/cpu.c | 30 +++++++++++++++++++++++-------
> > > >    1 file changed, 23 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > > > index 5a3678a789cf..c8d2a585723a 100644
> > > > --- a/target/i386/cpu.c
> > > > +++ b/target/i386/cpu.c
> > > > @@ -6014,7 +6014,6 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> > > >    {
> > > >        X86CPU *cpu = env_archcpu(env);
> > > >        CPUState *cs = env_cpu(env);
> > > > -    uint32_t die_offset;
> > > >        uint32_t limit;
> > > >        uint32_t signature[3];
> > > >        X86CPUTopoInfo topo_info;
> > > > @@ -6098,39 +6097,56 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> > > >                    int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
> > > >                    int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
> > > >                    if (cs->nr_cores > 1) {
> > > > +                    int addressable_cores_offset =
> > > > +                                                apicid_pkg_offset(&topo_info) -
> > > > +                                                apicid_core_offset(&topo_info);
> > > > +
> > > >                        *eax &= ~0xFC000000;
> > > > -                    *eax |= (pow2ceil(cs->nr_cores) - 1) << 26;
> > > > +                    *eax |= (1 << (addressable_cores_offset - 1)) << 26;
> > > 
> > > it should be ((1 << addressable_cores_offset) - 1) << 26
> > 
> > Good catch! The helper wrapped in a subsequent patch masks the error here.
> > 
> > > 
> > > I think naming it addressable_cores_width is better than
> > > addressable_cores_offset. It's not offset because offset means the bit
> > > position from bit 0.
> > 
> > I agree, "width" is better.
> > 
> > > 
> > > And we can get the width by another algorithm:
> > > 
> > > int addressable_cores_width = apicid_core_width(&topo_info) +
> > > apicid_die_width(&topo_info);
> > > *eax |= ((1 << addressable_cores_width) - 1)) << 26;
> > 
> > This algorithm lacks flexibility because there will be more topology
> > levels between package and core, such as the cluster being introduced...
> > 
> > Using "addressable_cores_width" is clear enough.
> > 
> > > 		
> > > >                    }
> > > >                    if (host_vcpus_per_cache > vcpus_per_socket) {
> > > > +                    int pkg_offset = apicid_pkg_offset(&topo_info);
> > > > +
> > > >                        *eax &= ~0x3FFC000;
> > > > -                    *eax |= (pow2ceil(vcpus_per_socket) - 1) << 14;
> > > > +                    *eax |= (1 << (pkg_offset - 1)) << 14;
> > > 
> > > Ditto, ((1 << pkg_offset) - 1) << 14
> > 
> > Thanks!
> > 
> > > 
> > > For this one, I think pow2ceil(vcpus_per_socket) is better. Because it's
> > > intuitive that when host_vcpus_per_cache > vcpus_per_socket, we expose
> > > vcpus_per_cache (configured by users) to VM.
> > 
> > I tend to use a uniform calculation that is less confusing and easier to
> > maintain.
> 
> less confusing?
> 
> the original code is
> 
> 	if (host_vcpus_per_cache > vcpus_per_socket) {
> 		*eax |= (pow2ceil(vcpus_per_socket) - 1) << 14;
> 	}
> 
> and this patch is going to change it to
> 
> 	if (host_vcpus_per_cache > vcpus_per_socket) {
> 		int pkg_offset = apicid_pkg_offset(&topo_info);
> 		*eax |= (1 << pkg_offset - 1)) << 14;
> 	}
> 
> Apparently, the former is clearer that everyone knows what is wants to do is
> "when guest's total vcpus_per_socket is even smaller than host's
> vcpu_per_cache, using guest's configuration". While the latter is more
> confusing.

IMO, the only differences are the variable naming and the way the
details are encoded, what is actually trying to be expressed is the
same - both set the cache topology at the package level.

There is no reason to use two encoding ways for the same field, and
it'll be a code maintenance disaster.

I can add comment here to allay your concern.

> 
> > Since this field encodes "Maximum number of addressable IDs",
> > OS can't get the exact number of CPUs/vCPUs sharing L3 from here, it can
> > only know that L3 is shared at the package level.
> 
> It doesn't matter with L3. What the code want to fulfill is that,

Yes, I misremembered here.

> 
> host_vcpus_per_cache is the actual number of LPs that share this level of
> cache. While vcpus_per_socket is the maximum numbere of LPs that can share a
> cache (at any level) in guest. When guest's maximum number is even smaller
> than host's, use guest's value.
> 

>From the Guest's view, the cache is shared at package level. In hardware,
this one field only reflects the topology level and does not accurately
reflect the number of sharing CPUs.

So, we just need to make it clear that in this case the Guest cache
topology level is package.

Thanks,
Zhao