On Fri, Oct 20, 2017 at 10:07:27AM +0100, Daniel P. Berrange wrote: > On Thu, Oct 19, 2017 at 05:56:49PM -0200, Eduardo Habkost wrote: > > On Thu, Oct 19, 2017 at 04:28:59PM +0100, Daniel P. Berrange wrote: > > > On Thu, Oct 19, 2017 at 11:21:22AM -0400, Igor Mammedov wrote: > > > > ----- Original Message ----- > > > > > From: "Daniel P. Berrange" <berrange@xxxxxxxxxx> > > > > > To: "Igor Mammedov" <imammedo@xxxxxxxxxx> > > > > > Cc: "peter maydell" <peter.maydell@xxxxxxxxxx>, pkrempa@xxxxxxxxxx, ehabkost@xxxxxxxxxx, cohuck@xxxxxxxxxx, > > > > > qemu-devel@xxxxxxxxxx, armbru@xxxxxxxxxx, pbonzini@xxxxxxxxxx, david@xxxxxxxxxxxxxxxxxxxxx > > > > > Sent: Wednesday, October 18, 2017 5:30:10 PM > > > > > Subject: Re: [Qemu-devel] [RFC 0/6] enable numa configuration before machine_init() from HMP/QMP > > > > > > > > > > On Tue, Oct 17, 2017 at 06:06:35PM +0200, Igor Mammedov wrote: > > > > > > On Tue, 17 Oct 2017 16:07:59 +0100 > > > > > > "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > > > > > > > > > > > > > On Tue, Oct 17, 2017 at 09:27:02AM +0200, Igor Mammedov wrote: > > > > > > > > On Mon, 16 Oct 2017 17:36:36 +0100 > > > > > > > > "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > On Mon, Oct 16, 2017 at 06:22:50PM +0200, Igor Mammedov wrote: > > > > > > > > > > Series allows to configure NUMA mapping at runtime using QMP/HMP > > > > > > > > > > interface. For that to happen it introduces a new '-paused' CLI > > > > > > > > > > option > > > > > > > > > > which allows to pause QEMU before machine_init() is run and > > > > > > > > > > adds new set-numa-node HMP/QMP commands which in conjuction with > > > > > > > > > > info hotpluggable-cpus/query-hotpluggable-cpus allow to configure > > > > > > > > > > NUMA mapping for cpus. > > > > > > > > > > > > > > > > > > What's the problem we're seeking solve here compared to what we > > > > > > > > > currently > > > > > > > > > do for NUMA configuration ? > > > > > > > > From RHBZ1382425 > > > > > > > > " > > > > > > > > Current -numa CLI interface is quite limited in terms that allow map > > > > > > > > CPUs to NUMA nodes as it requires to provide cpu_index values which > > > > > > > > are non obvious and depend on machine/arch. As result libvirt has to > > > > > > > > assume/re-implement cpu_index allocation logic to provide valid > > > > > > > > values for -numa cpus=... QEMU CLI option. > > > > > > > > > > > > > > In broad terms, this problem applies to every device / object libvirt > > > > > > > asks QEMU to create. For everything else libvirt is able to assign a > > > > > > > "id" string, which is can then use to identify the thing later. The > > > > > > > CPU stuff is different because libvirt isn't able to provide 'id' > > > > > > > strings for each CPU - QEMU generates a psuedo-id internally which > > > > > > > libvirt has to infer. The latter is the same problem we had with > > > > > > > devices before '-device' was introduced allowing 'id' naming. > > > > > > > > > > > > > > IMHO we should take the same approach with CPUs and start modelling > > > > > > > the individual CPUs as something we can explicitly create with -object > > > > > > > or -device. That way libvirt can assign names and does not have to > > > > > > > care about CPU index values, and it all works just the same way as > > > > > > > any other devices / object we create > > > > > > > > > > > > > > ie instead of: > > > > > > > > > > > > > > -smp 8,sockets=4,cores=2,threads=1 > > > > > > > -numa node,nodeid=0,cpus=0-3 > > > > > > > -numa node,nodeid=1,cpus=4-7 > > > > > > > > > > > > > > we could do: > > > > > > > > > > > > > > -object numa-node,id=numa0 > > > > > > > -object numa-node,id=numa1 > > > > > > > -object cpu,id=cpu0,node=numa0,socket=0,core=0,thread=0 > > > > > > > -object cpu,id=cpu1,node=numa0,socket=0,core=1,thread=0 > > > > > > > -object cpu,id=cpu2,node=numa0,socket=1,core=0,thread=0 > > > > > > > -object cpu,id=cpu3,node=numa0,socket=1,core=1,thread=0 > > > > > > > -object cpu,id=cpu4,node=numa1,socket=2,core=0,thread=0 > > > > > > > -object cpu,id=cpu5,node=numa1,socket=2,core=1,thread=0 > > > > > > > -object cpu,id=cpu6,node=numa1,socket=3,core=0,thread=0 > > > > > > > -object cpu,id=cpu7,node=numa1,socket=3,core=1,thread=0 > > > > > > the follow up question would be where do "socket=3,core=1,thread=0" > > > > > > come from, currently these options are the function of > > > > > > (-M foo -smp ...) and can be queried vi query-hotpluggble-cpus at > > > > > > runtime after qemu parses -M and -smp options. > > > > > > > > > > NB, I realize my example was open to mis-interpretation. The values I'm > > > > > illustrating here for socket=3,core=1,thread=0 and *not* ID values, they > > > > > are a plain enumeration of values. ie this is saying the 4th socket, the > > > > > 2nd core and the 1st thread. Internally QEMU might have the 2nd core > > > > > with a core-id of 8, or 7038 or whatever architecture specific numbering > > > > > scheme makes sense, but that's not what the mgmt app gives at the CLI > > > > > level > > > > Even though fixed properties/values simplicity is tempting and it might even > > > > work for what we have implemented in qemu currently (well, SPAPR will need > > > > refactoring (if possible) to meet requirements + compat stuff for current > > > > machines with sparse IDs). > > > > But I have to disagree here and try to oppose it. > > > > > > > > QEMU models concrete platforms/hw with certain non abstract properties > > > > and it's libvirt's domain to translate platform specific devices into > > > > 'spherical' devices with abstract properties. > > > > > > > > Now back to cpus and suggestion to fix the set of 'address' properties > > > > and their values into continuous enumeration range [0..N). That would > > > > 1. put a burden of hiding platform/device details on QEMU > > > > (which is already bad as QEMU's job is to emulate it) > > > > 2. with abstract 'address' properties and values, user won't have > > > > a clue as to where device is being attached (as qemu would magically > > > > remap that to fit specific machine needs) > > > > 2.1. if abstract 'address' properties and values we can do away with > > > > socket/core/thread/whatnot since they won't mean the same when considered > > > > from platform point of view, so we can just drop all these nonsense > > > > and go back to cpu-index that has all the properties you've suggested > > > > /abstract, [0..N]/. > > > > 3. we currently stopped with socket|core|thread-id properties as they are > > > > applicable to machines that support -device cpu, but it's up to machine > > > > to pick witch of these to use (x86: uses all, spar: uses core-id only), > > > > but current property set is open for extension if need arises without > > > > need to redefine interface. So fixed list of properties [even ignoring > > > > values impact] doesn't scale. > > > > > > Note from the libvirt POV, we don't expose socket-id/core-id/thread-id in our > > > guest XML, we just provide an overall count of sockets/cores/threads which is > > > portable. The only arch specific thing we would have todo is express constraints > > > about ratios of these - eg indicate in some way that ppc doesn't allow mutliple > > > threads per core for example. > > > > > > > We even have cpu-add command which takes cpu-index as argument and > > > > -numa node,cpus=0..X CLI option, good luck with figuring out which cpu goes > > > > where and if it makes any sense from platform point of view. > > > > > > > > That's why when designing hot plug for 'device_add cpu' interface, we ended up > > > > with new query-hotpluggble-cpus QMP command, which is currently used by libvirt > > > > for hot-plug: > > > > > > > > Approach allows > > > > 1: machine to publish properties/values that make sense from emulated > > > > platform point of view but still understandable by user of given hw. > > > > 2: user may use them as opaque mandatory properties to create cpu device if > > > > he/she doesn't care about where it's plugged. > > > > 3: if user cares about which cpu goes where, properties defined by machine > > > > provide that info from emulated hw point of view including platform specific > > > > details. > > > > 4: it's easy to extend set of properties/values if need arises without > > > > breaking users (provided user will put them all in -device/device_add > > > > options as it's supposed to) > > > > > > > > But current approach has drawback, to call query-hotpluggble-cpus, machine has to > > > > be started first, which is fine for hot plug but not for specifying CLI options. > > > > > > > > Currently that could be solved by starting qemu twice when 'defining domain', > > > > where on the first run mgmt queries board layout and caches it for all the next > > > > times the defined machine is started (change in machine/version/-smp/-cpu will > > > > invalidate, cache). > > > > > > > > This series allows to avoid this 1st time restart, when creating domain for > > > > the first time, mgmt can query layout and then specify numa mapping without > > > > restarting, it can cache defined mapping as commands exactly match corresponding > > > > CLI options and reuse cached options on the next domain starts. > > > > > > > > This approach could be extended further with "device_add cpu" command > > > > so it would be possible to start qemu with -smp 0,... and allow mgmt to > > > > create cpus with explicit IDs controlled by mgmt, and again mgmt may cache > > > > these commands and reuse them on CLI next time machine is started > > > > > > > > I think Eduardo's work on query-slots is superset of query-hotpluggble-cpus, > > > > but working to the same goal to allow mgmt discover which hw is provided by > > > > specific machine and where/which hw could be plugged (like which slot supports > > > > which kind of device and which 'address' should be used to attach device > > > > (socket|core... - for cpus, bus/function - for pic, ...) > > > > > > As mentioned elsewhere in the thread, the approach of defining the VM config > > > incrementally via the monitor has significant downsides, by making the config > > > invisible in any logs of the ARGV, and has likely performance impact when > > > starting up QEMU, particularly if it is used for more things going forward. To > > > me these downsides are enough to make the suggested approach for CPUs impractical > > > for libvirt to use. > > > > Those downsides do exist, but we should weight them against the > > downsides of not allowing any information at all to flow from > > QEMU to libvirt when starting a VM. > > > > I believe the code in libvirt/src/qemu/qemu_domain_address.c is > > a good illustration of those downsides. > > Right, but for this NUMA / CPU scenario I don't think we're going to end up > with complexity like this. I still believe we are able to come up with a > way to represent it at the CLI without so much architecture specific > knowledge. In the case of NUMA/CPU, I'm inclined to agree. > > Even if that is not possible though, from libvirt POV the extra complexity > is worth it, if that is what we need to preserve fast startup time. The > time to start a guest is very important to apps like libguestfs and libvirt > sandbox, so going down a direction which is likely to add 100's or even 1000's > of milliseconds to the startup time is not desirable, even if it makes libvirt > simpler I don't believe this is likely to add 100's or 1000's of milliseconds to startup time, but I agree we need to keep an eye on startup time while introducing new interfaces. -- Eduardo -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list