On 10/6/21 3:32 PM, Igor Mammedov wrote: > On Thu, 30 Sep 2021 14:08:34 +0200 > Peter Krempa <pkrempa@xxxxxxxxxx> wrote: > >> On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote: >>> QEMU is trying to obsolete -numa node,cpus= because that uses >>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new >>> form is: >>> >>> -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T >>> >>> which is repeated for every vCPU and places it at [S, D, C, T] >>> into guest NUMA node N. >>> >>> While in general this is magic mapping, we can deal with it. >>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology >>> is given then maxvcpus must be sockets * dies * cores * threads >>> (i.e. there are no 'holes'). >>> Secondly, if no topology is given then libvirt itself places each >>> vCPU into a different socket (basically, it fakes topology of: >>> [maxvcpus, 1, 1, 1]) >>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs >>> onto topology, to make sure vCPUs don't start to move around. >> >> There's a problem with this premise though and unfortunately we don't >> seem to have qemuxml2argvtest for it. >> >> On PPC64, in certain situations the CPU can be configured such that >> threads are visible only to VMs. This has substantial impact on how CPUs >> are configured using the modern parameters (until now used only for >> cpu hotplug purposes, and that's the reason vCPU hotplug has such >> complicated incantations when starting the VM). >> >> In the above situation a CPU with topology of: >> sockets=1, cores=4, threads=8 (thus 32 cpus) >> >> will only expose 4 CPU "devices". >> >> core-id: 0, core-id: 8, core-id: 16 and core-id: 24 >> >> yet the guest will correctly see 32 cpus when used as such. >> >> You can see this in: >> >> tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json >> >> Also note that the 'props' object does _not_ have any socket-id, and >> management apps are supposed to pass in 'props' as is. (There's a bunch >> of code to do that on hotplug). >> >> The problem is that you need to query the topology first (unless we want >> to duplicate all of qemu code that has to do with topology state and >> keep up with changes to it) to know how it's behaving on current >> machine. This historically was not possible. The supposed solution for >> this was the pre-config state where we'd be able to query and set it up >> via QMP, but I was not keeping up sufficiently with that work, so I >> don't know if it's possible. >> >> If preconfig is a viable option we IMO should start using it sooner >> rather than later and avoid duplicating qemu's logic here. > > using preconfig is preferable variant otherwise libvirt > would end up duplicating topology logic which differs not only > between targets but also between machine/cpu types. > > Closest example how to use preconfig is in pc_dynamic_cpu_cfg() > test case. Though it uses query-hotpluggable-cpus only for > verification, but one can use the command at the preconfig > stage to get topology for given -smp/-machine type combination. Alright, -preconfig should be pretty easy. However, I do have some points to raise/ask: 1) currently, exit-preconfig is marked as experimental (hence its "x-" prefix). Before libvirt consumes it, QEMU should make it stable. Is there anything that stops QEMU from doing so or is it just a matter of sending patches (I volunteer to do that)? 2) In my experiments I try to mimic what libvirt does. Here's my cmd line: qemu-system-x86_64 \ -S \ -preconfig \ -cpu host \ -smp 120,sockets=2,dies=3,cores=4,threads=5 \ -object '{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,memdev=ram-node0 \ -no-user-config \ -nodefaults \ -no-shutdown \ -qmp stdio and here is my QMP log: {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 6}, "package": "v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}} {"execute":"qmp_capabilities"} {"return": {}} {"execute":"query-hotpluggable-cpus"} {"return": [{"props": {"core-id": 3, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 3, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 2, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 1, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 0, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 2, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, <snip/> {"props": {"core-id": 0, "thread-id": 0, "die-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": "host-x86_64-cpu"}]} I can see that query-hotpluggable-cpus returns an array. Can I safely assume that vCPU ID == index in the array? I mean, if I did have -numa node,cpus=X can I do array[X] to obtain mapping onto Core/Thread/ Die/Socket which would then be fed to 'set-numa-node' command. If not, what is the proper way to do it? And one more thing - if QEMU has to keep vCPU ID mapping code, what's the point in obsoleting -numa node,cpus=? In the end it is still QEMU who does the ID -> [Core,Thread,Die,Socket] translation but with extra steps for mgmt applications. Michal