Re: RFC: New CPU hot(un)plug API and XML

Martin Kletzander <mkletzan@xxxxxxxxxx> · Mon, 13 Jun 2016 15:54:39 +0200

On Mon, Jun 13, 2016 at 02:48:51PM +0200, Peter Krempa wrote:
Hi list,

I'm planing on adding API that will be used instead of virDomainSetVcpus and
will allow a more granular control of which virtual CPUs are enabled for a
guest.

The new approach will allow to use cpu hotplug properly with NUMA guests as
the old APIs would not allow adding CPUs to very specific cgroups.

Great!  We need that... Er, mgmt apps need that =)

The old APIs should still work fine with the current approach although the
final implementation should also allow to unplug vcpus from the guest by using
new qemu features.

I'm still not sure though whether it will be possible to use this in a
backward compatible fashion though depending how this stuff will exactly need
to be set up in qemu.

If the worst comes to the worst, we can say the old API is deprecated
and it'll just do basic things (as it does now).  I haven't studied the
code like probably did for some time before sending this, but I don't
see that it should cause some major problems.

# API #

As for the new API I'm thinking of the following design:

int
virDomainVcpu(virDomainPtr domain,
             unsigned int id,
             unsigned int flags);

The flags for this API would be following:
 - usual domain modification impact:
   * VIR_DOMAIN_SET_VCPU_CURRENT
   * VIR_DOMAIN_SET_VCPU_LIVE
   * VIR_DOMAIN_SET_VCPU_CONFIG
 - for specifying the operation as the default operation would query the cpu
   state:
   * VIR_DOMAIN_SET_VCPU_ENABLE
   * VIR_DOMAIN_SET_VCPU_DISABLE
 - misc:
   * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug
   * VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where the
     cpu should be enabled/disabled rather than CPU id. This is a convenience
     flag that will allow to add cpu to a given numa node rather than having
     to find the correct ID.
   * VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This
                                makes sure that the CPU will be plugged in
                                on platforms that require to plug in multiple
                                threads at once.

VIR_DOMAIN_SET_VCPU_NUMA_NODE and VIR_DOMAIN_SET_VCPU_GUEST are mutually
exclusive as the guest agent doesn't report the guest numa node the CPU is
belonging to .

So since the agent can only receive number of vcpus then no new feature
will be usable with this flag until that command is added to the ga,
right?  Does it make sense to have this flag for the new API then?

If the idea of one API that will both query and set is too nonconformist to
our existing API design I have no problem adding Get/Set versions and/or
explode the ADD/REMOVE flags into a separate parameter.

I thought there already was a consensus reached about what should be the
default choice for new APIs.  I don't remember it, though, as I don't
feel strongly for any of those.

# XML #

The new API will require us to add new XML that will allow to track the state
of VCPUs individually. Internally we now have a data structure allowing to
keep the relevant data in one place.

Currently we are setting data relevant to VCPUs in many places.

<domain>
 [...]
 <vcpu current='1'>3</vcpu>
 [...]
 <cputune>
   <cpupin ... />
 </cputune>
 [...]
 <cpu>
   <numa>
     <cell id='0' cpus='0' memory='102400' unit='KiB/>
     <cell id='1' cpus='1-2' memory='102400' unit='KiB/>
   </numa>

As we'll be required to keep the state for every single cpu I'm thinking of
adding a new subelement called '<vcpus>' to <domain>. This will have a
'<vcpu>' subelement for every configured cpu.

I'm specifically not going to add any of the cpupin or numa node ids to the
/domain/vcpus/vcpu as input parameters to avoid introducing very compicated
checking code that would be required to keep the data in sync.

I'm thinking of adding the numa node id as an output only attribute since it's
relevant to the hotplug case and it's misplaced otherwise. I certainly can add
the duplicated data as output-only attributes.

The XML with the new elements should look like:

<domain>
 [...]
 <vcpu current='1'>3</vcpu>
 <vcpus>
   <vcpu id='0' state='enabled'/> <-- option 1, no extra data
   <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node,
                                                 since it's non-obvious
   <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/>
    <!-- option 3 all the data duplicated -->

It is nice to have all the info in there, but won't it confuse users if
it is output-only?  Wait, let me rephrase that question.  Won't it
confuse users?  Wait, most of our XML does already, so scratch that =)

Anyway, how much duplicated info do we already have?  I can now only
think of the memory device which we had to have anyways.  Would it be
too confusing to just add <cpu/> device with all the info?  That would
require all the checks and lot of unnecessary code.  But it would be
consistent with the memory.  And it actually is a device.  Most probably
not worth the pain.  But OTOH if all the data are output-only...

Sorry for the ramble, just my 2 cents.

 </vcpus>
 [...]
 <cputune>
   <cpupin ... />
 </cputune>
 [...]
 <cpu>
   <numa>
     <cell id='0' cpus='0' memory='102400' unit='KiB/>
     <cell id='1' cpus='1-2' memory='102400' unit='KiB/>
   </numa>

# migration #

To ensure migration compatibility a new libvirt will set a new migration
feature flag in cases where a sparse topology was created by any means. Older
versions of libvirt will reject it.

As the new cpu data will be ignored by the parser of older libvirt we don't
need to stop formatting them on migration. (fortunately schemas are not
validated during migration)

Unless there are some of those loops through all child
elements/attributes, but either you'll come across that or it will bite
you in the ass during the first migration trial ;)

# qemu/platform implementation caveats #

When starting the VM for the first time it might be necessary to start a
throw-away qemu process to query some details that we'll need to pass in on a
command line. I'm not sure if this is still necessary and I'll try to avoid it
at all cost.

I hope capabilities will tell us what we need.  If not, I hope it can be added.

[1] Some architectures (ppc64) don't directly support thread-level hotplug
and thus require us to plug in a core which translates into multiple threads
(8 in case of power 8).

Possibly other yet unknown problems.

Fingers crossed for least amount of those.

Thanks for your feedback.

Peter

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list
Attachment:
signature.asc

Description: Digital signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list