Re: [PATCH v9 0/5] KVM statistics data fd-based binary interface

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Tue, 15 Jun 2021 13:31:10 +0200

On 15/06/21 10:37, Enrico Weigelt, metux IT consult wrote:
* why is it binary instead of text ? is it so very high volume that
   it really matters ?

The main reason to have a binary format is not the high volume actually 
(though it also has its part).  Rather, we would really like to include 
the schema to make the statistics self-describing.  This includes stuff 
like whether the unit of measure of a statistic is clock cycles, 
nanoseconds, pages or whatnot; having this kind of information in text 
leads to awkwardness in the parsers.  trace-cmd is another example where 
the data consists of a schema followed by binary data.

Text format could certainly be added if there's a usecase, but for 
developer use debugfs is usually a suitable replacement.

Last year we tried the opposite direction: we built a one-value-per-file 
filesystem with a common API that any subsystem could use (e.g. 
providing ethtool stats, /proc/interrupts, etc. in addition to KVM 
stats).  We started with text, similar to sysfs, with the plan of 
extending it to a binary format later.  However, other subsystems 
expressed very little interest in this, so instead we decided to go with 
something that is designed around KVM needs.

Still, the binary format that KVM uses is designed not to be 
KVM-specific.  If other subsystems want to publish high-volume, 
self-describing statistic information, they are welcome to share the 
binary format and the code.  Perhaps it may make sense in some cases to 
have them in sysfs, even (e.g. /sys/kernel/slab/*/.stats).  As Greg said 
sysfs is currently one value per file, but perhaps that could be changed 
if the binary format is an additional way to access the information and 
not the only one (not that I'm planning to do it).

* how will possible future extensions of the telemetry packets work ?

The format includes a schema, so it's possible to add more statistics in 
the future.  The exact list of statistics varies per architecture and is 
not part of the userspace API (obvious caveat: https://xkcd.com/1172/).

* aren't there other means to get this fd instead of an ioctl() on the
   VM fd ? something more from the outside (eg. sysfs/procfs)

Not yet, but if there's a need it can be added.  It'd be plausible to 
publish system-wide statistics via a ioctl on /dev/kvm, for example. 
We'd have to check how this compares with stuff that is world-readable 
in procfs and sysfs, but I don't think there are security concerns in 
exposing that.

There's also pidfd_getfd(2) which can be used to pull a VM file 
descriptor from another running process.  That can be used to avoid the 
issue of KVM file descriptors being unnamed.

* how will that relate to other hypervisors ?

Other hypervisors do not run as part of the Linux kernel (at least they 
are not upstream).  These statistics only apply to Linux *hosts*, not 
guests.

As far as I know, there is no standard that Xen or the proprietary 
hypervisors use to communicate their telemetry info to monitoring tools, 
and also no standard binary format used by exporters to talk to 
monitoring tools.  If this format will be adopted by other hypervisors 
or any random software, I will be happy.

Some notes from the operating perspective:

In typical datacenters we've got various monitoring tools that are able
to catch up lots of data from different sources (especially files). If
an operator e.g. is interested in something in happening in some file
(e.g. in /proc of /sys), it's quite trivial - just configure yet another
probe (maybe some regex for parsing) and done. Automatically fed in his
$monitoring_solution (e.g. nagios, ELK, Splunk, whatsnot)

... but in practice what you do is you have prebuilt exporters that 
talks to $monitoring_solution.  Monitoring individual files is the 
exception, not the rule.  But indeed Libvirt already has I/O and network 
statistics and there is an exporter for Prometheus, so we should add 
support for this new method as well to both QEMU (exporting the file 
descriptor) and Libvirt.

I hope this helps clarifying your doubts!

Paolo

With your approach, it's not that simple: now the operator needs to
create (and deploy and manage) a separate agent that somehow receives
that fd from the VMM, reads and parses that specific binary stream
and finally pushes it into the monitoring infrastructure. Or the VMM
writes it into some file, where some monitoring agent can pick it up.
In any case, not actually trivial from ops perspective.