On Wed, 20 Oct 2021 13:07:59 +0200 Michal Prívozník <mprivozn@xxxxxxxxxx> wrote: > On 10/6/21 3:32 PM, Igor Mammedov wrote: > > On Thu, 30 Sep 2021 14:08:34 +0200 > > Peter Krempa <pkrempa@xxxxxxxxxx> wrote: > > > >> On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote: > >>> QEMU is trying to obsolete -numa node,cpus= because that uses > >>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new > >>> form is: > >>> > >>> -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T > >>> > >>> which is repeated for every vCPU and places it at [S, D, C, T] > >>> into guest NUMA node N. > >>> > >>> While in general this is magic mapping, we can deal with it. > >>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology > >>> is given then maxvcpus must be sockets * dies * cores * threads > >>> (i.e. there are no 'holes'). > >>> Secondly, if no topology is given then libvirt itself places each > >>> vCPU into a different socket (basically, it fakes topology of: > >>> [maxvcpus, 1, 1, 1]) > >>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs > >>> onto topology, to make sure vCPUs don't start to move around. > >> > >> There's a problem with this premise though and unfortunately we don't > >> seem to have qemuxml2argvtest for it. > >> > >> On PPC64, in certain situations the CPU can be configured such that > >> threads are visible only to VMs. This has substantial impact on how CPUs > >> are configured using the modern parameters (until now used only for > >> cpu hotplug purposes, and that's the reason vCPU hotplug has such > >> complicated incantations when starting the VM). > >> > >> In the above situation a CPU with topology of: > >> sockets=1, cores=4, threads=8 (thus 32 cpus) > >> > >> will only expose 4 CPU "devices". > >> > >> core-id: 0, core-id: 8, core-id: 16 and core-id: 24 > >> > >> yet the guest will correctly see 32 cpus when used as such. > >> > >> You can see this in: > >> > >> tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json > >> > >> Also note that the 'props' object does _not_ have any socket-id, and > >> management apps are supposed to pass in 'props' as is. (There's a bunch > >> of code to do that on hotplug). > >> > >> The problem is that you need to query the topology first (unless we want > >> to duplicate all of qemu code that has to do with topology state and > >> keep up with changes to it) to know how it's behaving on current > >> machine. This historically was not possible. The supposed solution for > >> this was the pre-config state where we'd be able to query and set it up > >> via QMP, but I was not keeping up sufficiently with that work, so I > >> don't know if it's possible. > >> > >> If preconfig is a viable option we IMO should start using it sooner > >> rather than later and avoid duplicating qemu's logic here. > > > > using preconfig is preferable variant otherwise libvirt > > would end up duplicating topology logic which differs not only > > between targets but also between machine/cpu types. > > > > Closest example how to use preconfig is in pc_dynamic_cpu_cfg() > > test case. Though it uses query-hotpluggable-cpus only for > > verification, but one can use the command at the preconfig > > stage to get topology for given -smp/-machine type combination. > > Alright, -preconfig should be pretty easy. However, I do have some > points to raise/ask: > > 1) currently, exit-preconfig is marked as experimental (hence its "x-" > prefix). Before libvirt consumes it, QEMU should make it stable. Is > there anything that stops QEMU from doing so or is it just a matter of > sending patches (I volunteer to do that)? if I recall correctly, it was made experimental due to lack of actual users (it was supposed that libvirt would consume it once available but it didn't happen for quite a long time). So patches to make it stable interface should be fine. > > 2) In my experiments I try to mimic what libvirt does. Here's my cmd > line: > > qemu-system-x86_64 \ > -S \ > -preconfig \ > -cpu host \ > -smp 120,sockets=2,dies=3,cores=4,threads=5 \ > -object '{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}' \ > -numa node,nodeid=0,memdev=ram-node0 \ > -no-user-config \ > -nodefaults \ > -no-shutdown \ > -qmp stdio > > and here is my QMP log: > > {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 6}, "package": "v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}} > > {"execute":"qmp_capabilities"} > {"return": {}} > > {"execute":"query-hotpluggable-cpus"} > {"return": [{"props": {"core-id": 3, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 3, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 2, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 1, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 0, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 2, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, > <snip/> > {"props": {"core-id": 0, "thread-id": 0, "die-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": "host-x86_64-cpu"}]} > > > I can see that query-hotpluggable-cpus returns an array. Can I safely > assume that vCPU ID == index in the array? I mean, if I did have -numa > node,cpus=X can I do array[X] to obtain mapping onto Core/Thread/ > Die/Socket which would then be fed to 'set-numa-node' command. If not, > what is the proper way to do it? >From QEMU point of view, you shouldn't assume anything about vCPU ordering within returned array. It's internal impl. detail and a subject to change without notice. What you can assume is that CPUs descriptions in array will be stable for a given combination of [machine version, smp option, CPU type]. > And one more thing - if QEMU has to keep vCPU ID mapping code, what's > the point in obsoleting -numa node,cpus=? In the end it is still QEMU > who does the ID -> [Core,Thread,Die,Socket] translation but with extra > steps for mgmt applications. point is that cpu_index is ambiguous and it's practically impossible to for user to tell which vCPU exactly it deals with unless user re-implements and keeps in sync topology code for f(board, machine version, smp option, CPU type) So even if cpu_index is still used inside of QEMU for other purposes, the external interfaces and API will be using only consistent topology tuple [Core,Thread,Die,Socket] to describe and address vCPUs, same like device_add. > Michal >