On Thu, Oct 29, 2015 at 02:02:29PM +0800, Qiaowei Ren wrote: > One RFC in > https://www.redhat.com/archives/libvir-list/2015-June/msg01509.html > > CMT (Cache Monitoring Technology) can be used to measure the > usage of cache by VM running on the host. This patch will > extend the bulk stats API (virDomainListGetStats) to add this > field. Applications based on libvirt can use this API to achieve > cache usage of VM. Because CMT implementation in Linux kernel > is based on perf mechanism, this patch will enable perf event > for CMT when VM is created and disable it when VM is destroyed. > > Signed-off-by: Qiaowei Ren <qiaowei.ren@xxxxxxxxx> Thanks for re-sending this patchset, it has reminded me of the concerns / questions I had around this previously. Just ignoring the code for a minute, IIUC the design is - Open a file handle to the kernel perf system for each running VM - Associate that perf event file handle with the QEMU VM PID - Enable recording of the CMT perf event on that file handle - Report the CMT event values in the virDomainGetStats() API call when VIR_DOMAIN_STATS_CACHE is requested My two primary concerns are 1. Do we want to have a perf event FD open for every running VM all the time. 2. Is the virDomainGetStats() integration the right API approach For item 1, my concern is that the CMT event is only ever going to be consumed by OpenStack, and even then, only OpenStack installs which have the schedular plugin that cares about the CMT event data. It feels undesirable to have this perf system enabled for all libvirt VMs, when perhaps < 1 % of libvirt users actually want this data. It feels like we need some mechanism to decide when this event is enabled For item 2, my concern is first when virDomainGetStats is the right API. I think it probably *is* the right API, since I can't think of a better way. Should we however, be having a very special case VIR_DOMAIN_STATS_CACHE group, or should we have something more generic. For example, if I run 'perf event' I see List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] alignment-faults [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] ...any many many more... Does it make sense to extend the virDomainStats API to *only* deal with reporting of 1 specific perf event that you care about right now. It feels like it might be better if we did something more general purpose. eg what if something wants to get 'major-faults' data in future ? So we add a VIR_DOMAIN_STATS_MAJOR_FAULT enum item, etc. Combining these two concerns, I think we might need 2 things - A new API to turn on/off collection of specific perf events This could be something like virDomainGetPerfEvents(virDOmainPtr dom, virTypedParameter params); This would fill virTypedParameters with one entry for each perf event, using the VIR_TYPED_PARAM_BOOLEAN type. A 'true' value would indicate that event is enabled for the VM. A corresponding virDomainSetPerfEvents(virDOmainPtr dom, virTypedParameter params); would enable you to toggle the flag, to enable/disable the particular list of perf events you care about. With that, we could have a 'VIR_DOMAIN_STATS_PERF_EVENT' enum item for virDomainStats which causes reporting of all previously enabled perf events This would avoid us needing to have the perf event enabled for all VMs all the time. Only applications using libvirt which actually need the data would turn it on. It would also be now scalable to all types of perf event, instead of just one specific event > diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h > index 54e1e7b..31bce33 100644 > --- a/src/qemu/qemu_domain.h > +++ b/src/qemu/qemu_domain.h > @@ -196,6 +196,9 @@ struct _qemuDomainObjPrivate { > > bool hookRun; /* true if there was a hook run over this domain */ > > + int cmt_fd; /* perf handler for CMT */ > + > + > /* Bitmaps below hold data from the auto NUMA feature */ > virBitmapPtr autoNodeset; > virBitmapPtr autoCpuset; > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c > index 4cfae03..8c678c9 100644 > --- a/src/qemu/qemu_driver.c > +++ b/src/qemu/qemu_driver.c > @@ -19320,6 +19320,53 @@ qemuDomainGetStatsBlock(virQEMUDriverPtr driver, > > #undef QEMU_ADD_COUNT_PARAM > > +static int > +qemuDomainGetStatsCache(virQEMUDriverPtr driver ATTRIBUTE_UNUSED, > + virDomainObjPtr dom, > + virDomainStatsRecordPtr record, > + int *maxparams, > + unsigned int privflags ATTRIBUTE_UNUSED) > +{ > + qemuDomainObjPrivatePtr priv = dom->privateData; > + FILE *fd; > + unsigned long long cache = 0; > + int scaling_factor = 0; > + > + if (priv->cmt_fd <= 0) > + return -1; > + > + if (read(priv->cmt_fd, &cache, sizeof(uint64_t)) < 0) { > + virReportSystemError(errno, "%s", > + _("Unable to read cache data")); > + return -1; > + } > + > + fd = fopen("/sys/devices/intel_cqm/events/llc_occupancy.scale", "r"); > + if (!fd) { > + virReportSystemError(errno, "%s", > + _("Unable to open CMT scale file")); > + return -1; > + } > + if (fscanf(fd, "%d", &scaling_factor) != 1) { > + virReportSystemError(errno, "%s", > + _("Unable to read CMT scale file")); > + VIR_FORCE_FCLOSE(fd); > + return -1; > + } > + VIR_FORCE_FCLOSE(fd); > + > + cache *= scaling_factor; > + > + if (virTypedParamsAddULLong(&record->params, > + &record->nparams, > + maxparams, > + "cache.current", > + cache) < 0) > + return -1; > + > + return 0; > +} > + > typedef int > (*qemuDomainGetStatsFunc)(virQEMUDriverPtr driver, > virDomainObjPtr dom, > @@ -19340,6 +19387,7 @@ static struct qemuDomainGetStatsWorker qemuDomainGetStatsWorkers[] = { > { qemuDomainGetStatsVcpu, VIR_DOMAIN_STATS_VCPU, false }, > { qemuDomainGetStatsInterface, VIR_DOMAIN_STATS_INTERFACE, false }, > { qemuDomainGetStatsBlock, VIR_DOMAIN_STATS_BLOCK, true }, > + { qemuDomainGetStatsCache, VIR_DOMAIN_STATS_CACHE, false }, > { NULL, 0, false } > }; > > diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c > index ba84182..00b889d 100644 > --- a/src/qemu/qemu_process.c > +++ b/src/qemu/qemu_process.c > @@ -25,8 +25,11 @@ > #include <unistd.h> > #include <signal.h> > #include <sys/stat.h> > +#include <sys/ioctl.h> > +#include <sys/syscall.h> > #if defined(__linux__) > # include <linux/capability.h> > +# include <linux/perf_event.h> > #elif defined(__FreeBSD__) > # include <sys/param.h> > # include <sys/cpuset.h> > @@ -4274,6 +4277,75 @@ qemuLogOperation(virDomainObjPtr vm, > goto cleanup; > } > > +/* > + * Enable CMT(Cache Monitoring Technology) to measure the usage of > + * cache by VM running on the node. > + * > + * Because the hypervisor implement CMT support basedon perf mechanism, > + * we should enable perf event for CMT. The function 'sys_erf_event_open' > + * is perf syscall wrapper. > + */ > +#ifdef __linux__ > +static long sys_perf_event_open(struct perf_event_attr *hw_event, > + pid_t pid, int cpu, int group_fd, > + unsigned long flags) > +{ > + return syscall(__NR_perf_event_open, hw_event, pid, cpu, > + group_fd, flags); > +} > +static int qemuCmtEnable(virDomainObjPtr vm) > +{ > + qemuDomainObjPrivatePtr priv = vm->privateData; > + struct perf_event_attr cmt_attr; > + int event_type; > + FILE *fp; > + > + fp = fopen("/sys/devices/intel_cqm/type", "r"); > + if (!fp) { > + virReportSystemError(errno, "%s", > + _("CMT is not available on this host")); > + return -1; > + } > + if (fscanf(fp, "%d", &event_type) != 1) { > + virReportSystemError(errno, "%s", > + _("Unable to read event type file.")); > + VIR_FORCE_FCLOSE(fp); > + return -1; > + } > + VIR_FORCE_FCLOSE(fp); > + > + memset(&cmt_attr, 0, sizeof(struct perf_event_attr)); > + cmt_attr.size = sizeof(struct perf_event_attr); > + cmt_attr.type = event_type; > + cmt_attr.config = 1; > + cmt_attr.inherit = 1; > + cmt_attr.disabled = 1; > + cmt_attr.enable_on_exec = 0; > + > + priv->cmt_fd = sys_perf_event_open(&cmt_attr, vm->pid, -1, -1, 0); > + if (priv->cmt_fd < 0) { > + virReportSystemError(errno, > + _("Unable to open perf type=%d for pid=%d"), > + event_type, vm->pid); > + return -1; > + } > + > + if (ioctl(priv->cmt_fd, PERF_EVENT_IOC_ENABLE) < 0) { > + virReportSystemError(errno, "%s", > + _("Unable to enable perf event for CMT")); > + return -1; > + } > + > + return 0; > +} > +#else > +static int qemuCmtEnable(virDomainObjPtr vm) > +{ > + virReportUnsupportedError(); > + return -1; > +} > +#endif > + > int qemuProcessStart(virConnectPtr conn, > virQEMUDriverPtr driver, > virDomainObjPtr vm, > @@ -4954,6 +5026,11 @@ int qemuProcessStart(virConnectPtr conn, > if (virDomainSaveStatus(driver->xmlopt, cfg->stateDir, vm) < 0) > goto cleanup; > > + VIR_DEBUG("Setting CMT perf counter"); > + if (qemuCmtEnable(vm) < 0) > + virReportSystemError(errno, "%s", > + _("CMT is not available on this host")); > + > /* finally we can call the 'started' hook script if any */ > if (virHookPresent(VIR_HOOK_DRIVER_QEMU)) { > char *xml = qemuDomainDefFormatXML(driver, vm->def, 0); > @@ -5122,6 +5199,15 @@ void qemuProcessStop(virQEMUDriverPtr driver, > virPortAllocatorRelease(driver->migrationPorts, priv->nbdPort); > priv->nbdPort = 0; > > + /* Disable CMT */ > + if (priv->cmt_fd > 0) { > + if (ioctl(priv->cmt_fd, PERF_EVENT_IOC_DISABLE) < 0) { > + virReportSystemError(errno, "%s", > + _("Unable to disable perf event for CMT")); > + } > + VIR_FORCE_CLOSE(priv->cmt_fd); > + } > + > if (priv->agent) { > qemuAgentClose(priv->agent); > priv->agent = NULL; Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list