RFC: New CPU hot(un)plug API and XML

Peter Krempa <pkrempa@xxxxxxxxxx> · Mon, 13 Jun 2016 14:48:51 +0200

Hi list,

I'm planing on adding API that will be used instead of virDomainSetVcpus and
will allow a more granular control of which virtual CPUs are enabled for a
guest.

The new approach will allow to use cpu hotplug properly with NUMA guests as
the old APIs would not allow adding CPUs to very specific cgroups.

The old APIs should still work fine with the current approach although the
final implementation should also allow to unplug vcpus from the guest by using
new qemu features.

I'm still not sure though whether it will be possible to use this in a
backward compatible fashion though depending how this stuff will exactly need
to be set up in qemu.

# API #

As for the new API I'm thinking of the following design:

int
virDomainVcpu(virDomainPtr domain,
              unsigned int id,
              unsigned int flags);

 The flags for this API would be following:
  - usual domain modification impact:
    * VIR_DOMAIN_SET_VCPU_CURRENT
    * VIR_DOMAIN_SET_VCPU_LIVE
    * VIR_DOMAIN_SET_VCPU_CONFIG
  - for specifying the operation as the default operation would query the cpu
    state:
    * VIR_DOMAIN_SET_VCPU_ENABLE
    * VIR_DOMAIN_SET_VCPU_DISABLE
  - misc:
    * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug
    * VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where the
      cpu should be enabled/disabled rather than CPU id. This is a convenience
      flag that will allow to add cpu to a given numa node rather than having
      to find the correct ID.
    * VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This
                                 makes sure that the CPU will be plugged in
                                 on platforms that require to plug in multiple
                                 threads at once.

VIR_DOMAIN_SET_VCPU_NUMA_NODE and VIR_DOMAIN_SET_VCPU_GUEST are mutually
exclusive as the guest agent doesn't report the guest numa node the CPU is
belonging to .

If the idea of one API that will both query and set is too nonconformist to
our existing API design I have no problem adding Get/Set versions and/or
explode the ADD/REMOVE flags into a separate parameter.

# XML #

The new API will require us to add new XML that will allow to track the state
of VCPUs individually. Internally we now have a data structure allowing to
keep the relevant data in one place.

Currently we are setting data relevant to VCPUs in many places.

<domain>
  [...]
  <vcpu current='1'>3</vcpu>
  [...]
  <cputune>
    <cpupin ... />
  </cputune>
  [...]
  <cpu>
    <numa>
      <cell id='0' cpus='0' memory='102400' unit='KiB/>
      <cell id='1' cpus='1-2' memory='102400' unit='KiB/>
    </numa>

As we'll be required to keep the state for every single cpu I'm thinking of
adding a new subelement called '<vcpus>' to <domain>. This will have a
'<vcpu>' subelement for every configured cpu.

I'm specifically not going to add any of the cpupin or numa node ids to the
/domain/vcpus/vcpu as input parameters to avoid introducing very compicated
checking code that would be required to keep the data in sync.

I'm thinking of adding the numa node id as an output only attribute since it's
relevant to the hotplug case and it's misplaced otherwise. I certainly can add
the duplicated data as output-only attributes.

The XML with the new elements should look like:

<domain>
  [...]
  <vcpu current='1'>3</vcpu>
  <vcpus>
    <vcpu id='0' state='enabled'/> <-- option 1, no extra data
    <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node,
                                                  since it's non-obvious
    <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/>
     <!-- option 3 all the data duplicated -->
  </vcpus>
  [...]
  <cputune>
    <cpupin ... />
  </cputune>
  [...]
  <cpu>
    <numa>
      <cell id='0' cpus='0' memory='102400' unit='KiB/>
      <cell id='1' cpus='1-2' memory='102400' unit='KiB/>
    </numa>

# migration #

To ensure migration compatibility a new libvirt will set a new migration
feature flag in cases where a sparse topology was created by any means. Older
versions of libvirt will reject it.

As the new cpu data will be ignored by the parser of older libvirt we don't
need to stop formatting them on migration. (fortunately schemas are not
validated during migration)

# qemu/platform implementation caveats #

When starting the VM for the first time it might be necessary to start a
throw-away qemu process to query some details that we'll need to pass in on a
command line. I'm not sure if this is still necessary and I'll try to avoid it
at all cost.

[1] Some architectures (ppc64) don't directly support thread-level hotplug
and thus require us to plug in a core which translates into multiple threads
(8 in case of power 8).

Possibly other yet unknown problems.

Thanks for your feedback.

Peter

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list