On Mon, Jan 19, 2015 at 10:23:22PM -0500, John Ferlan wrote:
On 01/13/2015 02:57 AM, Martin Kletzander wrote:In order for QEMU vCPU (and other) threads to run with RT scheduler, libvirt needs to take care of that so QEMU doesn't have to run privileged. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1178986 Signed-off-by: Martin Kletzander <mkletzan@xxxxxxxxxx> --- docs/formatdomain.html.in | 13 +++ docs/schemas/domaincommon.rng | 33 ++++++ src/conf/domain_conf.c | 137 +++++++++++++++++++++++- src/conf/domain_conf.h | 24 +++++ tests/qemuxml2argvdata/qemuxml2argv-cputune.xml | 4 + 5 files changed, 210 insertions(+), 1 deletion(-)You'll need to rebase - I couldn't get this to apply... So review is strictly visual and memory based.diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index d8c31e0..60dad76 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -550,6 +550,8 @@ <quota>-1</quota> <emulator_period>1000000</emulator_period> <emulator_quota>-1</emulator_quota> + <vcpusched scheduler='fifo' priority='1'/> + <iothreadsched scheduler='idle'/> </cputune> ... </domain> @@ -656,6 +658,17 @@ <span class="since">Only QEMU driver support since 0.10.0</span> </dd> + <dt><code>vcpusched</code> and <code>iothreadsched</code></dt> + <dd> + The optional <code>vcpusched</code> element specifies the scheduler + (values <code>batch</code>, <code>idle</code>, <code>fifo</code>, + <code>rr</code> and <code>other</code>, which is default and ignored) + for all vCPU/IOThread threads. For real-time schedulers + (<code>fifo</code>, <code>rr</code>), priority can be specified as + well. The value range for the priority depends on the host kernel. + <span class="since">Since 1.2.12</span> + </dd> +Why would someone use batch or idle? They're not described.
I just took all the options form sched_setscheduler() and made them available. But I have no problem removing some of them as we can always add them later on.
Upon first reading, I assumed one setting was good for *all* vcpu's and iothread's; however, later on there's a 'vcpuid' and 'iothreadid' defined in the rng file. Does it even make sense to have 2 vCPUs using 'fifo' while another 2 are using 'ff'? If not, then why allow choosing the id? (it's the "for all vCPU/IOThread threads" language that gave me the impression).
Yes, that's my fault. What you understood fine (and I described) was a first version. Then, before posting it, I enhanced that based on response from QEMU guys who say that someone might need per-thread specifics.
If you did desire different models, then why not model after the 'cpuset' with respect to being able to choose a range. If someone has 16 vCPU's are they going to be happy having to make 16 entries or just one that indicates 0-15. You should be able to borrow/reuse existing code for that parsing cpuset... Would listing priority with batch/idle result in error or be ignored? Testing will try it and they will flag it unless you say it'll be ignored.
It would be ignored. I went with all-optional approach, so it's easiest to use. And that takes me to the fact that one thing is still missing from this. I'd suggest (myself) to make the vcpu/iothread ID also optional and that one element without that particular ID would be default for all threads not specified in other elements. I'm just not sure that goes fine with our current approach (or rather all our current approaches as we're *so* inconsistent).
I wouldn't list "other" - it's confusing. I don't think other XML lists other. Why would someone list "other" - they would only have this if they wanted one of the models.</dl> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index d250e6f..6653737 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -815,10 +815,43 @@ </attribute> </element> </zeroOrMore> + <zeroOrMore> + <element name="vcpusched"> + <attribute name="vcpu"> + <ref name="vcpuid"/> + </attribute> + <ref name="schedparam"/> + </element> + </zeroOrMore> + <zeroOrMore> + <element name="iothreadsched"> + <attribute name="iothread"> + <ref name="unsignedInt"/> + </attribute> + <ref name="schedparam"/> + </element> + </zeroOrMore>so essentially : <vcpusched vcpuid=n scheduler='string' priority=n/> and <iothreadsched iothread=n scheduler='string' priority=n/></interleave> </element> </define> + <define name="schedparam"> + <attribute name="scheduler"> + <choice> + <value>other</value> + <value>batch</value> + <value>idle</value> + <value>fifo</value> + <value>rr</value> + </choice> + </attribute> + <optional> + <attribute name="priority"> + <ref name="unsignedShort"/> + </attribute> + </optional> + </define> +leaving other here is OK... could also go with "none"
I took other as SCHED_OTHER from sched_setscheduler(). If I'd add "none", I'd add it before "other" and "other" would then mean that the user wants to specifically call sched_setscheduler() with SCHED_OTHER. That might be useful if you want to use some policy for all vCPUs except one particular one.
<!-- All the NUMA related tunables would go in the numatune --> <define name="numatune"> <element name="numatune"> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 96d80a9..16adbf1 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -795,6 +795,13 @@ VIR_ENUM_IMPL(virDomainLoader, "rom", "pflash") +VIR_ENUM_IMPL(virDomainThreadSched, VIR_DOMAIN_THREAD_SCHED_LAST, + "other", /* default */ + "batch", + "idle", + "fifo", + "rr") +other a/k/a none/* Internal mapping: subset of block job types that can be present in * <mirror> XML (remaining types are not two-phase). */ VIR_ENUM_DECL(virDomainBlockJob) @@ -2260,6 +2267,9 @@ void virDomainDefFree(virDomainDefPtr def) virDomainVcpuPinDefArrayFree(def->cputune.iothreadspin, def->cputune.niothreadspin); + VIR_FREE(def->cputune.vcpusched); + VIR_FREE(def->cputune.iothreadsched); + virDomainNumatuneFree(def->numatune); virSysinfoDefFree(def->sysinfo); @@ -12653,6 +12663,65 @@ virDomainLoaderDefParseXML(xmlNodePtr node, return ret; } +static int +virDomainThreadSchedParse(xmlNodePtr node, + unsigned int maxid, + const char *name, + virDomainThreadSchedParamPtr sp) +{ + char *tmp = NULL; + int sched = 0; + + tmp = virXMLPropString(node, name); + if (!tmp) { + virReportError(VIR_ERR_XML_ERROR, + _("invalid value for attribute %s"), name); + goto error; + } + + if (virStrToLong_uip(tmp, NULL, 10, &sp->id) < 0 || + sp->id >= maxid) { + virReportError(VIR_ERR_XML_ERROR, + _("vcpu must be positive integer smaller than " + "maximum of vcpus, not '%s'"), + tmp);Hmm.... this is challenging... vcpus are numbered 0..n-1; however, iothreads are numbered 1..n
Yes, but I remember you had a reason for that.
Use the 'name' instead of 'vcpu' _("Invalid %ssched value '%s'", name, tmp).
That should be OK for translators. Or at least I hope so.
However, I think the range check should be in the caller. Then again, why are we allowing someone to choose per cpu?+ goto error; + } + VIR_FREE(tmp); + + tmp = virXMLPropString(node, "scheduler"); + if (tmp) { + if ((sched = virDomainThreadSchedTypeFromString(tmp)) < 0) {Why accept "other"? If you choose to not accept "other", then this would be <= 0... That way on output the only way it gets written is if it's been set to something.+ virReportError(VIR_ERR_INTERNAL_ERROR, + _("invalid vcpu scheduler: '%s'"), tmp);change 'vcpu scheduler: '%s' to %ssched scheduler='%s' value, where the first %s is the 'name' argument.+ goto error; + } + sp->scheduler = sched; + + VIR_FREE(tmp); + if (sp->scheduler >= VIR_DOMAIN_THREAD_SCHED_FIFO) {Oh - this is dangerous... Today you've set them up that way, but if some day something is added after RR that doesn't support priority, then there'll be an error. Be specific here.
OK, that'll be better.
+ tmp = virXMLPropString(node, "priority"); + if (!tmp) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing scheduler priority")); + goto error; + }Priority is listed as optional in the rng, but this implies it must be provided. Also you'll need to document that it's only parsed for fifo/rr...+ if (virStrToLong_i(tmp, NULL, 10, &sp->priority) < 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("invalid value for element priority")); + goto error; + } + VIR_FREE(tmp); + } + } + + return 0; + + error: + VIR_FREE(tmp); + return -1; +} + static virDomainDefPtr virDomainDefParseXML(xmlDocPtr xml, xmlNodePtr root, @@ -13201,6 +13270,51 @@ virDomainDefParseXML(xmlDocPtr xml, } VIR_FREE(nodes); + if ((n = virXPathNodeSet("./cputune/vcpusched", ctxt, &nodes)) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("cannot extract vcpusched nodes")); + goto error; + } + if (n) { + if (n > def->maxvcpus) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("too many vcpusched nodes in cputune")); + goto error; + } + + if (VIR_ALLOC_N(def->cputune.vcpusched, n) < 0) + goto error; + def->cputune.nvcpusched = n; + + for (i = 0; i < def->cputune.nvcpusched; i++) { + if (virDomainThreadSchedParse(nodes[i], def->maxvcpus, "vcpu", + &def->cputune.vcpusched[i]) < 0) + goto error; + }linear algorithm, but not linear number to vcpusched... Also, could follow the 'cpuset' algorithm...
My idea would be to allocate the parameters per each cpu and just fill it in like I did with memnode. And for the cpuset=, I was under the impression that it would just confuse people. Look at the vcpupin and iothreadpin, we don't put it there... But I'm not against that.
+ }VIR_FREE(nodes); ?
Yes!
+ + if ((n = virXPathNodeSet("./cputune/iothreadsched", ctxt, &nodes)) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("cannot extract iothreadsched nodes")); + goto error; + } + if (n) { + if (n > def->iothreads) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("too many iothreadsched nodes in cputune")); + goto error; + } + + if (VIR_ALLOC_N(def->cputune.iothreadsched, n) < 0) + goto error; + def->cputune.niothreadsched = n; + + for (i = 0; i < def->cputune.niothreadsched; i++) { + if (virDomainThreadSchedParse(nodes[i], def->iothreads + 1, "iothread", + &def->cputune.iothreadsched[i]) < 0) + goto error; + } + }VIR_FREE(nodes); ?
And yes.
/* analysis of cpu handling */ if ((node = virXPathNode("./cpu[1]", ctxt)) != NULL) { @@ -19519,7 +19633,8 @@ virDomainDefFormatInternal(virDomainDefPtr def, def->cputune.period || def->cputune.quota || def->cputune.emulatorpin || def->cputune.emulator_period || def->cputune.emulator_quota || - def->cputune.niothreadspin) { + def->cputune.niothreadspin || + def->cputune.vcpusched || def->cputune.iothreadsched) { virBufferAddLit(buf, "<cputune>\n"); cputune = true; } @@ -19592,6 +19707,26 @@ virDomainDefFormatInternal(virDomainDefPtr def, VIR_FREE(cpumask); } + for (i = 0; i < def->cputune.nvcpusched; i++) { + virDomainThreadSchedParamPtr sp = &def->cputune.vcpusched[i]; + virBufferAsprintf(buf, "<vcpusched vcpu='%d' scheduler='%s'", + sp->id, + virDomainThreadSchedTypeToString(sp->scheduler)); + if (sp->priority) + virBufferAsprintf(buf, " priority='%d'", sp->priority); + virBufferAddLit(buf, "/>\n"); + } + + for (i = 0; i < def->cputune.niothreadsched; i++) { + virDomainThreadSchedParamPtr sp = &def->cputune.iothreadsched[i]; + virBufferAsprintf(buf, "<iothreadsched iothread='%d' scheduler='%s'", + sp->id, + virDomainThreadSchedTypeToString(sp->scheduler)); + if (sp->priority) + virBufferAsprintf(buf, " priority='%d'", sp->priority); + virBufferAddLit(buf, "/>\n"); + } + virBufferAdjustIndent(buf, -2); if (cputune) virBufferAddLit(buf, "</cputune>\n"); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 5cc42d1..636e2d1 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1815,6 +1815,24 @@ typedef enum { VIR_DOMAIN_CPU_PLACEMENT_MODE_LAST } virDomainCpuPlacementMode; +typedef enum { + VIR_DOMAIN_THREAD_SCHED_OTHER = 0, + VIR_DOMAIN_THREAD_SCHED_BATCH, + VIR_DOMAIN_THREAD_SCHED_IDLE, + VIR_DOMAIN_THREAD_SCHED_FIFO, + VIR_DOMAIN_THREAD_SCHED_RR, + + VIR_DOMAIN_THREAD_SCHED_LAST +} virDomainThreadSched; + +typedef struct _virDomainThreadSchedParam virDomainThreadSchedParam; +typedef virDomainThreadSchedParam *virDomainThreadSchedParamPtr; +struct _virDomainThreadSchedParam { + unsigned int id; + virDomainThreadSched scheduler; + int priority; +}; + typedef struct _virDomainTimerCatchupDef virDomainTimerCatchupDef; typedef virDomainTimerCatchupDef *virDomainTimerCatchupDefPtr; struct _virDomainTimerCatchupDef { @@ -2002,6 +2020,11 @@ struct _virDomainCputune { virDomainVcpuPinDefPtr emulatorpin; size_t niothreadspin; virDomainVcpuPinDefPtr *iothreadspin; + + size_t nvcpusched; + virDomainThreadSchedParamPtr vcpusched; + size_t niothreadsched; + virDomainThreadSchedParamPtr iothreadsched; }; typedef struct _virDomainBlkiotune virDomainBlkiotune; @@ -2807,6 +2830,7 @@ VIR_ENUM_DECL(virDomainRNGModel) VIR_ENUM_DECL(virDomainRNGBackend) VIR_ENUM_DECL(virDomainTPMModel) VIR_ENUM_DECL(virDomainTPMBackend) +VIR_ENUM_DECL(virDomainThreadSched) /* from libvirt.h */ VIR_ENUM_DECL(virDomainState) VIR_ENUM_DECL(virDomainNostateReason) diff --git a/tests/qemuxml2argvdata/qemuxml2argv-cputune.xml b/tests/qemuxml2argvdata/qemuxml2argv-cputune.xml index 813d201..243092b 100644 --- a/tests/qemuxml2argvdata/qemuxml2argv-cputune.xml +++ b/tests/qemuxml2argvdata/qemuxml2argv-cputune.xml @@ -4,6 +4,7 @@ <memory unit='KiB'>219136</memory> <currentMemory unit='KiB'>219136</currentMemory> <vcpu placement='static'>2</vcpu> + <iothreads>1</iothreads> <cputune> <shares>2048</shares> <period>1000000</period> @@ -11,6 +12,9 @@ <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <emulatorpin cpuset='1'/> + <vcpusched vcpu='0' scheduler='fifo' priority='1'/> + <vcpusched vcpu='1' scheduler='batch'/> + <iothreadsched iothread='0' scheduler='idle'/>iothread='0' - no such beast. 1..n read the iothreadpin description
Ouch, what a beast it had become...
</cputune> <os> <type arch='i686' machine='pc'>hvm</type>I think this should be a new file - eg qemuxml2argv-cpusched.xml with the accompanying qemuxml2argv-cpusched.args file... That way you're testing the results with and without your new XML Since priority is listed as optional for 'fifo' and 'rr', you'll also need one 'cpusched' with priority defined and one without it defined using 'fifo'/'rr'. So that's either two more files or more vcpus in your XML.
OK, I'll do that. Thanks for the review, Martin
Attachment:
pgpI76SbF8UYf.pgp
Description: PGP signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list