Re: [PATCH 4/5] qemu: Prefer -numa cpu over -numa node,cpus=

Igor Mammedov <imammedo@xxxxxxxxxx> · Tue, 26 May 2020 16:51:16 +0200

On Mon, 25 May 2020 10:05:08 +0200
Michal Privoznik <mprivozn@xxxxxxxxxx> wrote:

> On 5/22/20 7:18 PM, Igor Mammedov wrote:
> > On Fri, 22 May 2020 18:28:31 +0200
> > Michal Privoznik <mprivozn@xxxxxxxxxx> wrote:
> >   
> >> On 5/22/20 6:07 PM, Igor Mammedov wrote:  
> >>> On Fri, 22 May 2020 16:14:14 +0200
> >>> Michal Privoznik <mprivozn@xxxxxxxxxx> wrote:
> >>>      
> >>>> QEMU is trying to obsolete -numa node,cpus= because that uses
> >>>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new
> >>>> form is:
> >>>>
> >>>>     -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
> >>>>
> >>>> which is repeated for every vCPU and places it at [S, D, C, T]
> >>>> into guest NUMA node N.
> >>>>
> >>>> While in general this is magic mapping, we can deal with it.
> >>>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
> >>>> is given then maxvcpus must be sockets * dies * cores * threads
> >>>> (i.e. there are no 'holes').
> >>>> Secondly, if no topology is given then libvirt itself places each
> >>>> vCPU into a different socket (basically, it fakes topology of:
> >>>> [maxvcpus, 1, 1, 1])
> >>>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
> >>>> onto topology, to make sure vCPUs don't start to move around.
> >>>>
> >>>> Note, migration from old to new cmd line works and therefore
> >>>> doesn't need any special handling.
> >>>>
> >>>> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1678085
> >>>>
> >>>> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx>
> >>>> ---
> >>>>    src/qemu/qemu_command.c                       | 108 +++++++++++++++++-
> >>>>    .../hugepages-nvdimm.x86_64-latest.args       |   4 +-
> >>>>    ...memory-default-hugepage.x86_64-latest.args |  10 +-
> >>>>    .../memfd-memory-numa.x86_64-latest.args      |  10 +-
> >>>>    ...y-hotplug-nvdimm-access.x86_64-latest.args |   4 +-
> >>>>    ...ry-hotplug-nvdimm-align.x86_64-latest.args |   4 +-
> >>>>    ...ry-hotplug-nvdimm-label.x86_64-latest.args |   4 +-
> >>>>    ...ory-hotplug-nvdimm-pmem.x86_64-latest.args |   4 +-
> >>>>    ...ory-hotplug-nvdimm-ppc64.ppc64-latest.args |   4 +-
> >>>>    ...hotplug-nvdimm-readonly.x86_64-latest.args |   4 +-
> >>>>    .../memory-hotplug-nvdimm.x86_64-latest.args  |   4 +-
> >>>>    ...vhost-user-fs-fd-memory.x86_64-latest.args |   4 +-
> >>>>    ...vhost-user-fs-hugepages.x86_64-latest.args |   4 +-
> >>>>    ...host-user-gpu-secondary.x86_64-latest.args |   3 +-
> >>>>    .../vhost-user-vga.x86_64-latest.args         |   3 +-
> >>>>    15 files changed, 158 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
> >>>> index 7d84fd8b5e..0de4fe4905 100644
> >>>> --- a/src/qemu/qemu_command.c
> >>>> +++ b/src/qemu/qemu_command.c
> >>>> @@ -7079,6 +7079,91 @@ qemuBuildNumaOldCPUs(virBufferPtr buf,
> >>>>    }
> >>>>    
> >>>>    
> >>>> +/**
> >>>> + * qemuTranlsatevCPUID:
> >>>> + *
> >>>> + * For given vCPU @id and vCPU topology (@cpu) compute corresponding
> >>>> + * @socket, @die, @core and @thread). This assumes linear topology,
> >>>> + * that is every [socket, die, core, thread] combination is valid vCPU
> >>>> + * ID and there are no 'holes'. This is ensured by
> >>>> + * qemuValidateDomainDef() if QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS is
> >>>> + * set.  
> >>> I wouldn't make this assumption, each machine can have (and has) it's own layout,
> >>> and now it's not hard to change that per machine version if necessary.
> >>>
> >>> I'd suppose one could pull the list of possible CPUs from QEMU started
> >>> in preconfig mode with desired -smp x,y,z using QUERY_HOTPLUGGABLE_CPUS
> >>> and then continue to configure numa with QMP commands using provided
> >>> CPUs layout.  
> >>
> >> Continue where? At the 'preconfig mode' the guest is already started,
> >> isn't it? Are you suggesting that libvirt starts a dummy QEMU process,
> >> fetches the CPU topology from it an then starts if for real? Libvirt  
> > QEMU started but it's very far from starting guest, at that time it's possible
> > configure numa mapping at runtime and continue to -S or running state
> > without restarting QEMU. For the follow up starts, used topology and numa options
> > can be cached and reused at CLI time as long as machine/-smp combination stays
> > the same.  
> 
> This is a problem. The domain XML that is provided can't be changed, 
> mostly because mgmt apps construct it on the fly and then just pass it 
> as a RO string to libvirt. While libvirt could create a separate cache, 
> there has to be a better way.
> 
> I mean, I can add some more code that once the guest is running 
> preserves the mapping during migration. But that assumes a running QEMU. 
> When starting a domain from scratch, is it acceptable it vCPU topology 
> changes? I suspect it is not.
I'm not sure I got you but
vCPU topology isn't changnig but when starting QEMU, user has to map
'concrete vCPUs' to spencific numa nodes. The issue here is that
to specify concrete vCPUs user needs to get layout from QEMU first
as it's a function of target/machine/-smp and possibly cpu type.

that applies not only '-numa cpu' but also to -device cpufoo,
that's why query-hotpluggable-cpus was introduced to let
user get the list of possible CPUs (including topo properties needed to
create them) for a given set of CLI options.

If I recall right libvirt uses topo properies during cpu hotplug but
treats it mainly as opaqueue info so it could feed it back to QEMU.

> >> tries to avoid that as much as it can.
> >>  
> >>>
> >>> How to present it to libvirt user I'm not sure (give them that list perhaps
> >>> and let select from it???)  
> >>
> >> This is what I am trying to figure out in the cover letter. Maybe we
> >> need to let users configure the topology (well, vCPU id to [socket, die,
> >> core, thread] mapping), but then again, in my testing the guest ignored
> >> that and displayed different topology (true, I was testing with -cpu
> >> host, so maybe that's why).  
> > there is ongiong issue with EPYC VCPUs topology, but I otherwise it should work.
> > Just report bug to qemu-devel, if it's broken.
> >     
> >>  
> >>> But it's irrelevant, to the patch, magical IDs for socket/core/...whatever
> >>> should not be generated by libvirt anymore, but rather taken from QEMU for given
> >>> machine + -smp combination.  
> >>
> >> Taken when? We can do this for running machines, but not for freshly
> >> started ones, can we?  
> > 
> > it can be used for freshly started as well,
> > QEMU -S -preconfig -M pc -smp ...
> > (QMP) query-hotpluggable-cpus
> > (QMP) set-numa-node ...
> > ...
> > (QMP) exit-preconfig
> > (QMP) other stuff libvirt does (like hot-plugging CPUs , ...)
> > (QMP) cont  
> 
> I'm not sure this works. query-hotpluggable-cpus does not map vCPU ID 
> <-> socket/core/thread, For '-smp 2,sockets=2,cores=1,threads=1' the 
> 'query-hotpluggable-cpus' returns:
> 
> {"return": [{"props": {"core-id": 0, "thread-id": 0, "socket-id": 1}, 
> "vcpus-count": 1, "type": "qemu64-x86_64-cpu"}, {"props": {"core-id": 0, 
> "thread-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": 
> "qemu64-x86_64-cpu"}]}

that's the list I was taling about, which is implicitly ordered by cpu_index

> And 'query-cpus' or 'query-cpus-fast' which map vCPU ID onto 
> socket/core/thread are not allowed in preconfig state.
these 2 commands apply to present cpu only, if I'm not mistaken.
query-hotpluggable-cpus shows not only present but also CPUs that
could be hotplugged with device_add or used with -device.

> But if I take a step back, the whole point of deprecating -numa 
> node,cpus= is that QEMU no longer wants to do vCPU ID <-> 
> socket/core/thread mapping because it's ambiguous. So it feels a bit 
> weird to design a solution where libvirt would ask QEMU to provide the 
> mapping only so that it can be configured back. Not only because of the 
> extra step, but also because QEMU can't then remove the mapping anyway. 
> I might be misunderstanding the issue though.
if '-numa node,cpus' is removed, we no longer will be using cpu_index as
configuration interface with user, that would allow QEMU start pruning
it from HMP/QMP interfaces and then probably remove it internally.
(I haven't explored yet if we could get rid of it completely but
I'd expect migration stream would be the only reason to keep it intrenally).

I'm quite reluctant to add cpu_index to modern query-hotpluggable-cpus output,
since the whole goal is to get rid of the index, which don't actually work
with SPAPR where CPU entity is a core and threads are internal impl. detail
(while cpu_index has 1:1 mapping with threads).

However if it will let QEMU to drop '-numa node,cpus=', we can discuss
adding optional 'x-cpu-index' to query-hotpluggable-cpus, that will be available
for old machine types for the sole purpose to help libvirt map old CLI to new one.
New machines shouldn't care about index though, since they should be using
'-numa cpu'.

PS:
Since libvirt would like to avoid restarting QEMU, after discussing between
libvirt and QEMU floks, how to approach chicken/egg problem (started qemu vs
cpu descriptors being available after start), -preconfig mode (however ugly) was
introduced as a compromise.

It was discussed on qemu-devel under subject
"enable numa configuration before  machine_init() from HMP/QMP"
which started with RFC and eventually was merged as v7.

I hope above clarifies questions you (would) have.
> 
> Michal
>