Re: [PATCH 2/3] Qemu: add CMT support

"Daniel P. Berrange" <berrange@xxxxxxxxxx> · Thu, 29 Oct 2015 23:55:32 +0900

On Thu, Oct 29, 2015 at 02:02:29PM +0800, Qiaowei Ren wrote:
> One RFC in
> https://www.redhat.com/archives/libvir-list/2015-June/msg01509.html
> 
> CMT (Cache Monitoring Technology) can be used to measure the
> usage of cache by VM running on the host. This patch will
> extend the bulk stats API (virDomainListGetStats) to add this
> field. Applications based on libvirt can use this API to achieve
> cache usage of VM. Because CMT implementation in Linux kernel
> is based on perf mechanism, this patch will enable perf event
> for CMT when VM is created and disable it when VM is destroyed.
> 
> Signed-off-by: Qiaowei Ren <qiaowei.ren@xxxxxxxxx>

Thanks for re-sending this patchset, it has reminded me of the
concerns / questions I had around this previously.

Just ignoring the code for a minute, IIUC the design is

 - Open a file handle to the kernel perf system for each running VM
 - Associate that perf event file handle with the QEMU VM PID
 - Enable recording of the CMT perf event on that file handle
 - Report the CMT event values in the virDomainGetStats() API
   call when VIR_DOMAIN_STATS_CACHE is requested

My two primary concerns are

 1. Do we want to have a perf event FD open for every running
    VM all the time.
 2. Is the virDomainGetStats() integration the right API approach

For item 1, my concern is that the CMT event is only ever going
to be consumed by OpenStack, and even then, only OpenStack installs
which have the schedular plugin that cares about the CMT event
data. It feels undesirable to have this perf system enabled for
all libvirt VMs, when perhaps < 1 % of libvirt users actually
want this data. It feels like we need some mechanism to decide
when this event is enabled

For item 2, my concern is first when virDomainGetStats is the
right API. I think it probably *is* the right API, since I can't
think of a better way.

Should we however, be having a very special case VIR_DOMAIN_STATS_CACHE
group, or should we have something more generic.

For example, if I run 'perf event' I see

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  ...any many many more...

Does it make sense to extend the virDomainStats API to *only* deal with
reporting of 1 specific perf event that you care about right now. It
feels like it might be better if we did something more general purpose.

eg what if something wants to get 'major-faults' data in future ?
So we add a VIR_DOMAIN_STATS_MAJOR_FAULT enum item, etc.

Combining these two concerns, I think we might need 2 things

 - A new API to turn on/off collection of specific perf events

This could be something like

   virDomainGetPerfEvents(virDOmainPtr dom,
                          virTypedParameter params);

This would fill virTypedParameters with one entry for each
perf event, using the VIR_TYPED_PARAM_BOOLEAN type. A 'true'
value would indicate that event is enabled for the VM. A
corresponding

   virDomainSetPerfEvents(virDOmainPtr dom,
                          virTypedParameter params);

would enable you to toggle the flag, to enable/disable the
particular list of perf events you care about.

With that, we could have a 'VIR_DOMAIN_STATS_PERF_EVENT' enum
item for virDomainStats which causes reporting of all previously
enabled perf events

This would avoid us needing to have the perf event enabled for
all VMs all the time. Only applications using libvirt which
actually need the data would turn it on. It would also be now
scalable to all types of perf event, instead of just one specific
event

> diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h
> index 54e1e7b..31bce33 100644
> --- a/src/qemu/qemu_domain.h
> +++ b/src/qemu/qemu_domain.h
> @@ -196,6 +196,9 @@ struct _qemuDomainObjPrivate {
>  
>      bool hookRun;  /* true if there was a hook run over this domain */
>  
> +    int cmt_fd;  /* perf handler for CMT */
> +
> +
>      /* Bitmaps below hold data from the auto NUMA feature */
>      virBitmapPtr autoNodeset;
>      virBitmapPtr autoCpuset;
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> index 4cfae03..8c678c9 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -19320,6 +19320,53 @@ qemuDomainGetStatsBlock(virQEMUDriverPtr driver,
>  
>  #undef QEMU_ADD_COUNT_PARAM
>  
> +static int
> +qemuDomainGetStatsCache(virQEMUDriverPtr driver ATTRIBUTE_UNUSED,
> +                        virDomainObjPtr dom,
> +                        virDomainStatsRecordPtr record,
> +                        int *maxparams,
> +                        unsigned int privflags ATTRIBUTE_UNUSED)
> +{
> +    qemuDomainObjPrivatePtr priv = dom->privateData;
> +    FILE *fd;
> +    unsigned long long cache = 0;
> +    int scaling_factor = 0;
> +
> +    if (priv->cmt_fd <= 0)
> +        return -1;
> +
> +    if (read(priv->cmt_fd, &cache, sizeof(uint64_t)) < 0) {
> +        virReportSystemError(errno, "%s",
> +                             _("Unable to read cache data"));
> +        return -1;
> +    }
> +
> +    fd = fopen("/sys/devices/intel_cqm/events/llc_occupancy.scale", "r");
> +    if (!fd) {
> +        virReportSystemError(errno, "%s",
> +                             _("Unable to open CMT scale file"));
> +        return -1;
> +    }
> +    if (fscanf(fd, "%d", &scaling_factor) != 1) {
> +        virReportSystemError(errno, "%s",
> +                             _("Unable to read CMT scale file"));
> +        VIR_FORCE_FCLOSE(fd);
> +        return -1;
> +    }
> +    VIR_FORCE_FCLOSE(fd);
> +
> +    cache *= scaling_factor;
> +
> +    if (virTypedParamsAddULLong(&record->params,
> +                                &record->nparams,
> +                                maxparams,
> +                                "cache.current",
> +                                cache) < 0)
> +        return -1;
> +
> +    return 0;
> +}
> +
>  typedef int
>  (*qemuDomainGetStatsFunc)(virQEMUDriverPtr driver,
>                            virDomainObjPtr dom,
> @@ -19340,6 +19387,7 @@ static struct qemuDomainGetStatsWorker qemuDomainGetStatsWorkers[] = {
>      { qemuDomainGetStatsVcpu, VIR_DOMAIN_STATS_VCPU, false },
>      { qemuDomainGetStatsInterface, VIR_DOMAIN_STATS_INTERFACE, false },
>      { qemuDomainGetStatsBlock, VIR_DOMAIN_STATS_BLOCK, true },
> +    { qemuDomainGetStatsCache, VIR_DOMAIN_STATS_CACHE, false },
>      { NULL, 0, false }
>  };
>  
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index ba84182..00b889d 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -25,8 +25,11 @@
>  #include <unistd.h>
>  #include <signal.h>
>  #include <sys/stat.h>
> +#include <sys/ioctl.h>
> +#include <sys/syscall.h>
>  #if defined(__linux__)
>  # include <linux/capability.h>
> +# include <linux/perf_event.h>
>  #elif defined(__FreeBSD__)
>  # include <sys/param.h>
>  # include <sys/cpuset.h>
> @@ -4274,6 +4277,75 @@ qemuLogOperation(virDomainObjPtr vm,
>      goto cleanup;
>  }
>  
> +/*
> + * Enable CMT(Cache Monitoring Technology) to measure the usage of
> + * cache by VM running on the node.
> + *
> + * Because the hypervisor implement CMT support basedon perf mechanism,
> + * we should enable perf event for CMT. The function 'sys_erf_event_open'
> + * is perf syscall wrapper.
> + */
> +#ifdef __linux__
> +static long sys_perf_event_open(struct perf_event_attr *hw_event,
> +                                pid_t pid, int cpu, int group_fd,
> +                                unsigned long flags)
> +{
> +    return syscall(__NR_perf_event_open, hw_event, pid, cpu,
> +                        group_fd, flags);
> +}
> +static int qemuCmtEnable(virDomainObjPtr vm)
> +{
> +    qemuDomainObjPrivatePtr priv = vm->privateData;
> +    struct perf_event_attr cmt_attr;
> +    int event_type;
> +    FILE *fp;
> +
> +    fp = fopen("/sys/devices/intel_cqm/type", "r");
> +    if (!fp) {
> +        virReportSystemError(errno, "%s",
> +                             _("CMT is not available on this host"));
> +        return -1;
> +    }
> +    if (fscanf(fp, "%d", &event_type) != 1) {
> +        virReportSystemError(errno, "%s",
> +                             _("Unable to read event type file."));
> +        VIR_FORCE_FCLOSE(fp);
> +        return -1;
> +    }
> +    VIR_FORCE_FCLOSE(fp);
> +
> +    memset(&cmt_attr, 0, sizeof(struct perf_event_attr));
> +    cmt_attr.size = sizeof(struct perf_event_attr);
> +    cmt_attr.type = event_type;
> +    cmt_attr.config = 1;
> +    cmt_attr.inherit = 1;
> +    cmt_attr.disabled = 1;
> +    cmt_attr.enable_on_exec = 0;
> +
> +    priv->cmt_fd = sys_perf_event_open(&cmt_attr, vm->pid, -1, -1, 0);
> +    if (priv->cmt_fd < 0) {
> +        virReportSystemError(errno,
> +                             _("Unable to open perf type=%d for pid=%d"),
> +                             event_type, vm->pid);
> +        return -1;
> +    }
> +
> +    if (ioctl(priv->cmt_fd, PERF_EVENT_IOC_ENABLE) < 0) {
> +        virReportSystemError(errno, "%s",
> +                             _("Unable to enable perf event for CMT"));
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +#else
> +static int qemuCmtEnable(virDomainObjPtr vm)
> +{
> +    virReportUnsupportedError();
> +    return -1;
> +}
> +#endif
> +
>  int qemuProcessStart(virConnectPtr conn,
>                       virQEMUDriverPtr driver,
>                       virDomainObjPtr vm,
> @@ -4954,6 +5026,11 @@ int qemuProcessStart(virConnectPtr conn,
>      if (virDomainSaveStatus(driver->xmlopt, cfg->stateDir, vm) < 0)
>          goto cleanup;
>  
> +    VIR_DEBUG("Setting CMT perf counter");
> +    if (qemuCmtEnable(vm) < 0)
> +        virReportSystemError(errno, "%s",
> +                             _("CMT is not available on this host"));
> +
>      /* finally we can call the 'started' hook script if any */
>      if (virHookPresent(VIR_HOOK_DRIVER_QEMU)) {
>          char *xml = qemuDomainDefFormatXML(driver, vm->def, 0);
> @@ -5122,6 +5199,15 @@ void qemuProcessStop(virQEMUDriverPtr driver,
>      virPortAllocatorRelease(driver->migrationPorts, priv->nbdPort);
>      priv->nbdPort = 0;
>  
> +    /* Disable CMT */
> +    if (priv->cmt_fd > 0) {
> +        if (ioctl(priv->cmt_fd, PERF_EVENT_IOC_DISABLE) < 0) {
> +            virReportSystemError(errno, "%s",
> +                                 _("Unable to disable perf event for CMT"));
> +        }
> +        VIR_FORCE_CLOSE(priv->cmt_fd);
> +    }
> +
>      if (priv->agent) {
>          qemuAgentClose(priv->agent);
>          priv->agent = NULL;

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list