Re: [PATCH 00/32] Add support for versioned CPU models

Jiri Denemark <jdenemar@xxxxxxxxxx> · Thu, 21 Nov 2024 16:14:58 +0100

On Wed, Nov 20, 2024 at 14:41:29 +0000, Daniel P. Berrangé wrote:
> On Wed, Nov 20, 2024 at 12:39:17PM +0000, Daniel P. Berrangé wrote:
> > On Tue, Nov 19, 2024 at 07:49:36PM +0100, Jiri Denemark wrote:
> > > When parsing a domain XML which uses a non-versioned CPU model we want
> > > to replace it with the appropriate version variant similarly to what we
> > > do with machine types. Theoretically QEMU supports per machine type
> > > specification of a version with which a non-versioned CPU model is
> > > replaced, but this is always 1 for all machine types and the
> > > query-machines QMP command does not even report the value.
> > > 
> > > Luckily after talking to Igor, having a single number per machine type
> > > does not really allow for setting it to anything but 1 as CPU models
> > > have different number of versions. Each machine type would need to
> > > define a specific version for each CPU model, which would be a
> > > maintenance nightmare. For this reason there's no desire to ever resolve
> > > non-versioned CPU models to anything but v1 in QEMU and the per machine
> > > type setting will most likely even be removed completely. Thus it is
> > > safe for us to always use v1 as the canonical CPU model.
> > > 
> > > Some non-versioned CPU models, however, are actually aliases to specific
> > > versions of a base model rather than being base models themselves. These
> > > are the old CPU model variants before model versions were introduced,
> > > e.g., -noTSX, -IBRS, etc. The mapping of these names to versions is
> > > hardcoded and will never change. We do not translate such CPU models to
> > > the corresponding versioned names. This allows us to introduce the
> > > corresponding -v* variants that match the QEMU models rather than the
> > > existing definitions in our CPU map. The guest CPU will be the same
> > > either way, but the way libvirt checks the CPU model compatibility with
> > > the host will be different. The old "partial" check done by libvirt
> > > using the definition from CPU map will still be used for the old names
> > > (we can't change this for compatibility reasons), but the corresponding
> > > versioned variants (as well as all other versions that do not have a
> > > non-versioned alias) will benefit from the recently introduced new
> > > "partial" check which uses only the information we get from QEMU to
> > > check whether a specific CPU definition is usable on the host.
> > > 
> > > Other I considered were:
> > > - replace -noTSX, -IBRS, ... models with their versioned variants
> > >     - we'd need to translate them back for migration (just what we do
> > >       for -v1) for backward compatibility
> > >     - I found the benefit of new partial checking when explicitly using
> > >       the versioned variants quite appealing and dropped the relevant
> > >       changes in progress
> > > 
> > > - do not translate anything, i.e., not even base models to -v1
> > >     - the idea behind translating was to make sure QEMU suddenly doesn't
> > >       start translating the base CPU model to a different version (this
> > >       does not happen with -noTSX etc. as they are hardcoded aliases);
> > >       Igor said they will never do that so is this still valid?
> > >     - not translating would bring the same benefit of explicitly using
> > >       -v1 vs non-versioned name
> > > 
> > > I guess the current mix does not look very consistent (i.e., it's not
> > > either all or nothing), but it makes sense to me. The question is
> > > whether it also makes sense to others :-)
> > 
> > Yeah, the inconsistency pokes at my brain.
> > 
> > As a slight diversion first, let me point to domcapabilities output
> > 
> > $ virsh domcapabilities --xpath '//model' | grep Skylake-Client
> > <model usable="no" vendor="Intel">Skylake-Client</model>
> > <model usable="no" vendor="Intel">Skylake-Client-IBRS</model>
> > <model usable="no" vendor="Intel">Skylake-Client-noTSX-IBRS</model>
> > <model usable="no" vendor="Intel">Skylake-Client-v1</model>
> > <model usable="no" vendor="Intel">Skylake-Client-v2</model>
> > <model usable="no" vendor="Intel">Skylake-Client-v3</model>
> > <model usable="no" vendor="Intel">Skylake-Client-v4</model>
> > 
> > I'm not a fan of duplicating the the CPU models here.
> > 
> > By comparison for machine types we avoid the duplication thus, by
> > explicitly telling the mgmt app what the aliases are:
> > 
> >   <machine canonical="pc-q35-8.2" maxCpus="1024">q35</machine>
> >   <machine maxCpus="1024">pc-q35-8.1</machine>
> >   <machine maxCpus="1024">pc-q35-8.2</machine>
> >   <machine maxCpus="255">pc-q35-2.4</machine>
> >   <machine maxCpus="255">pc-q35-2.5</machine>
> > 
> > I think we should be exposing to mgmt apps that some CPU model names are
> > merely an alias of another model.
> > 
> > This brings up the question of what we call the "canonical" name. Is
> > "Skylake-Client" canonical, or is "Skylake-Client-v1" canonical ?
> > 
> > ie do we report
> > 
> > $ virsh domcapabilities --xpath '//model' | grep Skylake-Client
> > <model usable="no" vendor="Intel" canonical="Skylake-Client-v1">Skylake-Client</model>
> > <model usable="no" vendor="Intel" canonical="Skylake-Client-v2">Skylake-Client-IBRS</model>
> > <model usable="no" vendor="Intel" canonical="Skylake-Client-v3">Skylake-Client-noTSX-IBRS</model>
> > <model usable="no" vendor="Intel">Skylake-Client-v4</model>
> > 
> > or
> > 
> > $ virsh domcapabilities --xpath '//model' | grep Skylake-Client
> > <model usable="no" vendor="Intel" canonical="Skylake-Client">Skylake-Client-v1</model>
> > <model usable="no" vendor="Intel" canonical="Skylake-Client-IBRS">Skylake-Client-v2</model>
> > <model usable="no" vendor="Intel" canonical="Skylake-Client-noTSX-IBRS">Skylake-Client-v3</model>
> > <model usable="no" vendor="Intel">Skylake-Client-v4</model>
> 
> Looking at query-cpu-definitions it reports:
> 
>         {
>             "alias-of": "Broadwell-v1",
>             "deprecated": false,
>             "migration-safe": true,
>             "name": "Broadwell",
>             "static": false,
>             "typename": "Broadwell-x86_64-cpu",
>             "unavailable-features": [
>                 "pcid",
>                 "tsc-deadline",
>                 "hle",
>                 "invpcid",
>                 "rtm"
>             ]
>         },
> 
> That suggests the "-v1" name is the canonical name, which in turn
> points to using
> 
>  $ virsh domcapabilities --xpath '//model' | grep Skylake-Client
>  <model usable="no" vendor="Intel" canonical="Skylake-Client-v1">Skylake-Client</model>
>  <model usable="no" vendor="Intel" canonical="Skylake-Client-v2">Skylake-Client-IBRS</model>
>  <model usable="no" vendor="Intel" canonical="Skylake-Client-v3">Skylake-Client-noTSX-IBRS</model>
>  <model usable="no" vendor="Intel">Skylake-Client-v4</model>
> 
> which is possibly a good thing from our POV. Existing libvirt mgmt
> applications parsing domcapabilities likely expect our traditional
> CPU names to appear as //model/text(). So by adding the versioned
> name as //model/@canonical apps only have to know about the new
> canonical versioned names if they choose to. Any name matching
> logic they is unlikely to be broken.
> 
> 
> On the other hand, if we match our machine type behaviour wrt domain XML,
> we would translate all names into their canonical (versioned) format. In
> doing so there is a back compatibility risk though, as appps may well not
> be expecting libvirt to canonicalize the name behind their back. We got
> away with it with versioned machine types as that was very early in life.
> Can we get away with canonicalizing CPU names today ?
> 
> If we don't canonicalize domain XML, the canonical names have practically
> zero value to mgmt apps, other than perhaps as a way to show a linear
> progression of changes over time. The latter would imply, however, that
> apps interpret the 'vNNN' part of the name instead of treating the name
> as an opaque string.

The original idea (ages ago) was to canonicalize CPU models. But at that
point we were thinking QEMU would change what a non-version CPU model is
translated to in time. But they don't really do it and have no intent
doing it in the future either. Other -v* that have an alias were never
supposed to change.

So I would say we don't really need to do any canonicalization at all as
the only benefit is not really needed. We'd need to undo it anyway for
migration compatibility, which could actually look strange if done even
on domains that were defined using the canonical name.

Jirka