Re: [PATCH v7 10/16] i386/cpu: Introduce cluster-id to X86CPU

Xiaoyao Li <xiaoyao.li@xxxxxxxxx> · Wed, 17 Jan 2024 00:40:12 +0800

On 1/15/2024 11:18 PM, Zhao Liu wrote:
Hi Xiaoyao,

On Mon, Jan 15, 2024 at 03:45:58PM +0800, Xiaoyao Li wrote:
Date: Mon, 15 Jan 2024 15:45:58 +0800
From: Xiaoyao Li <xiaoyao.li@xxxxxxxxx>
Subject: Re: [PATCH v7 10/16] i386/cpu: Introduce cluster-id to X86CPU

On 1/15/2024 1:59 PM, Zhao Liu wrote:
(Also cc "machine core" maintainers.)
u
Hi Xiaoyao,

On Mon, Jan 15, 2024 at 12:18:17PM +0800, Xiaoyao Li wrote:
Date: Mon, 15 Jan 2024 12:18:17 +0800
From: Xiaoyao Li <xiaoyao.li@xxxxxxxxx>
Subject: Re: [PATCH v7 10/16] i386/cpu: Introduce cluster-id to X86CPU

On 1/15/2024 11:27 AM, Zhao Liu wrote:
On Sun, Jan 14, 2024 at 09:49:18PM +0800, Xiaoyao Li wrote:
Date: Sun, 14 Jan 2024 21:49:18 +0800
From: Xiaoyao Li <xiaoyao.li@xxxxxxxxx>
Subject: Re: [PATCH v7 10/16] i386/cpu: Introduce cluster-id to X86CPU

On 1/8/2024 4:27 PM, Zhao Liu wrote:
From: Zhuocheng Ding <zhuocheng.ding@xxxxxxxxx>

Introduce cluster-id other than module-id to be consistent with
CpuInstanceProperties.cluster-id, and this avoids the confusion
of parameter names when hotplugging.

I don't think reusing 'cluster' from arm for x86's 'module' is a good idea.
It introduces confusion around the code.

There is a precedent: generic "socket" v.s. i386 "package".

It's not the same thing. "socket" vs "package" is just software people and
hardware people chose different name. It's just different naming issue.

No, it's a similar issue. Same physical device, different name only.

Furthermore, the topology was introduced for resource layout and silicon
fabrication, and similar design ideas and fabrication processes are fairly
consistent across common current arches. Therefore, it is possible to
abstract similar topological hierarchies for different arches.

however, here it's reusing name issue while 'cluster' has been defined for
x86. It does introduce confusion.

There's nothing fundamentally different between the x86 module and the
generic cluster, is there? This is the reason that I don't agree with
introducing "modules" in -smp.

generic cluster just means the cluster of processors, i.e, a group of
cpus/lps. It is just a middle level between die and core.

Not sure if you mean the "cluster" device for TCG GDB? "cluster" device
is different with "cluster" option in -smp.

No, I just mean the word 'cluster'. And I thought what you called 
"generic cluster" means "a cluster of logical processors"

Below I quote the description of Yanan's commit 864c3b5c32f0:

    A cluster generally means a group of CPU cores which share L2 cache
    or other mid-level resources, and it is the shared resources that
    is used to improve scheduler's behavior. From the point of view of
    the size range, it's between CPU die and CPU core. For example, on
    some ARM64 Kunpeng servers, we have 6 clusters in each NUMA node,
    and 4 CPU cores in each cluster. The 4 CPU cores share a separate
    L2 cache and a L3 cache tag, which brings cache affinity advantage.

What I get from it, is, cluster is just a middle level between CPU die 
and CPU core. The cpu cores inside one cluster shares some mid-level 
resource. L2 cache is just one example of the shared mid-level resource. 
So it can be either module level, or tile level in x86, or even the 
diegrp level you mentioned below.

When Yanan introduced the "cluster" option in -smp, he mentioned that it
is for sharing L2 and L3 tags, which roughly corresponds to our module.

It can be the module level in intel, or tile level. Further, if per die lp
number increases in the future, there might be more middle levels in intel
between die and core. Then at that time, how to decide what level should
cluster be mapped to?

Currently, there're 3 levels defined in SDM which are between die and
core: diegrp, tile and module. In our products, L2 is just sharing on the
module, so the intel's module and the general cluster are the best match.

you said 'generic cluster' a lot of times. But from my point of view, 
you are referring to current ARM's cluster instead of *generic* cluster.

Anyway, cluster is just a mid-level between die and core. We should not 
associate it any specific resource. A resource is shared in what level 
can change, e.g., initially L3 cache is shared in a physical package. 
When multi-die got supported, L3 cache is shared in one die. Now, on AMD 
product, L3 cache is shared in one complex, and one die can have 2 
complexs thus 2 separate L3 cache in one die.

It doesn't matter calling it cluster, or module, or xyz. It is just a 
name to represent a cpu topology level between die and core. What 
matters is, once it gets accepted, it becomes formal ABI for users that 
'cluster' means 'module' for x86 users. This is definitely a big 
confusion for people. Maybe people try to figure out why, and find the 
reason is that 'cluster' means the level at which L2 cache is shared and 
that's just the module level in x86 shares L2 cache. Maybe in the 
future, "L2 is shared in module" get changed just like the example I 
give for L3 above. Then, that's really the big confusion, and all this 
become the "historic reason" that cluster is chosen to represent module 
in x86.

There are no commercially available machines for the other levels yet,
so there's no way to ensure exactly what the future holds, but we should
try to avoid fragmentation of the topology hierarchy and try to maintain
the uniform and common topology hierarchies for QEMU.

Unless a new level for -smp is introduced in the future when an unsolvable
problem is raised.

The direct definition of cluster is the level that is above the "core"
and shares the hardware resources including L2. In this sense, arm's
cluster is the same as x86's module.

then, what about intel implements tile level in the future? why ARM's
'cluster' is mapped to 'module', but not 'tile' ?

This depends on the actual need.

Module (for x86) and cluster (in general) are similar, and tile (for x86)
is used for L3 in practice, so I use module rather than tile to map
generic cluster.

And, it should be noted that x86 module is mapped to the generic cluster,
not to ARM's. It's just that currently only ARM is using the clusters
option in -smp.

I believe QEMU provides the abstract and unified topology hierarchies in
-smp, not the arch-specific hierarchies.

reusing 'cluster' for 'module' is just a bad idea.

Though different arches have different naming styles, but QEMU's generic
code still need the uniform topology hierarchy.

generic code can provide as many topology levels as it can. each ARCH can
choose to use the ones it supports.

e.g.,

in qapi/machine.json, it says,

# The ordering from highest/coarsest to lowest/finest is:
# @drawers, @books, @sockets, @dies, @clusters, @cores, @threads.

This ordering is well-defined...

#
# Different architectures support different subsets of topology
# containers.
#
# For example, s390x does not have clusters and dies, and the socket
# is the parent container of cores.

we can update it to

# The ordering from highest/coarsest to lowest/finest is:
# @drawers, @books, @sockets, @dies, @clusters, @module, @cores,
# @threads.

...but here it's impossible to figure out why cluster is above module,
and even I can't come up with the difference between cluster and module.

#
# Different architectures support different subsets of topology
# containers.
#
# For example, s390x does not have clusters and dies, and the socket
# is the parent container of cores.
#
# For example, x86 does not have drawers and books, and does not support
# cluster.

even if cluster of x86 is supported someday in the future, we can remove the
ordering requirement from above description.

x86's cluster is above the package.

To reserve this name for x86, we can't have the well-defined topology
ordering.

But topology ordering is necessary in generic code, and many
calculations depend on the topology ordering.

could you point me to the code?

Yes, e.g., there're 2 helpers: machine_topo_get_cores_per_socket() and
machine_topo_get_threads_per_socket().

I see. these two helpers are fragile, that they need to be updated every 
time new level between core and socket is introduced.

Anyway, we can ensure the order for each ARCH, that the valid levels for 
any ARCH are ordered. e.g., we have

@drawers, @books, @sockets, @dies, @clusters, @module, @cores, @threads

defined,

for s390, the valid levels are

 @drawers, @books, @sockets, @cores, @threads

for arm, the valid levels are

 @sockets, @dies, @clusters, @cores, @threads
 (I'm not sure if die is supported for ARM?)

for x86, the valid levels are

 @sockets, @dies, @module, @cores, @threads

All of them are ordered. those unsupported level in each ARCH just get 
value 1. It won't have any issue in the calculation for the default 
value, but you provided two functions may not be lucky. anyway, they can 
be fixed at the time when we really go this approach.

s390 just added 'drawer' and 'book' in cpu topology[1]. I think we can also
add a module level for x86 instead of reusing cluster.

(This is also what I want to reply to the cover letter.)

[1] https://lore.kernel.org/qemu-devel/20231016183925.2384704-1-nsg@xxxxxxxxxxxxx/

These two new levels have the clear topological hierarchy relationship
and don't duplicate existing ones.

"book" or "drawer" may correspond to intel's "cluster".

Maybe, in the future, we could support for arch-specific alias topologies
in -smp.

I don't think we need alias, reusing 'cluster' for 'module' doesn't gain any
benefit except avoid adding a new field in SMPconfiguration. All the other
cluster code is ARM specific and x86 cannot share.

The point is that there is no difference between intel module and general
cluster...Considering only the naming issue, even AMD has the "complex" to
correspond to the Intel's "module".

does complex of AMD really match with intel module? L3 cache is shared in
one complex, while L2 cache is shared in one module for now.

If then it could correspond to intel's tile, which is after all a level
below die.

So if AMD wants to add complex in smp topology, where should complex 
level get put? between die and cluster?

Thanks,
Zhao