Re: [PATCH v9 3/5] KVM: stats: Add documentation for statistics data binary interface

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Wed, 16 Jun 2021 18:59:15 +0200

On 16/06/21 17:21, Greg KH wrote:
I forgot to comment on this one, sorry for the delay.

Why are you "inventing" your own schema format here for this?  Why not
use a well-known or at least well-designed/implemented one that we have
in userspace already?

There are a few that I would love to see in the kernel, varlink being
the best example.  We have kernel examples of this and I would consider
using that as a transport for sysfs-like data in the future, but never
got around to it.

Thanks, that's a good observation, and it's a problem that the rationale 
and the design process didn't end up in either the documentation or the 
commit message (only the outcome did).  In order to fix that, this is 
going to be quite a long message.

Varlink comprises both a schema and an encoding, and each has its own 
problems.  For the encoding of the data, varlink is really just using 
JSON and it is really just the wrong tool here.  The first few problems 
that come to mind are:

- varlink structs are encoded as JSON dictionaries.  Therefore, every 
time userspace reads the fields, the kernel has to include the field 
names as JSON dictionary keys.  This means that a lot of time is spent 
writing buffers, and on the receiving side parsing them.

- because numeric data has to be converted to ASCII the output doesn't 
have fixed offsets, so it is not possible to make an efficient 
implementation of pread.

- even though Varlink specifies that int is "usually int64", a 
little-known gem is that JSON behavior for numbers not representable as 
a double (i.e. exceeding 2^53) is implementation-defined; 
implementations can and will mess up values outside that range.  This 
problem is not specific to this KVM stats usecase; varlink's schema 
specification really is at odds with its encoding specification.

For the schema, there are some specific problems with varlink, but also 
a more generic issue.  The specific problems are:

- the schema doesn't include the length of arrays.  This makes it hard 
to compute in advance lengths and offsets of fields (even ignoring the 
fact that data is not binary, which I'll get to later)

- the schema also is not extensible with user annotations.  In our case 
that would be mostly the unit in which the value is expressed.

The main problem with the various available serialization formats is 
that their schema languages are designed to be compiled with a code 
generator.  Here instead the schema is transmitted from the kernel to 
userspace.  Userspace is not really supposed to know what a value means 
or even that if it exists.  Userspace takes care of collecting the data 
from the kernel, but ultimately there will be a human that knows what 
e.g. "io_exits" or "req_event" means, and they will ask for the current 
value, or a plot over time, of a specific statistic.

Now, unlike most other schema languages, varlink does not require 
precompiling the schema in its C bindings.  However, this was mostly 
just the authors not bothering to write a C code generator---they did 
write one for Rust, for example.  When using the "official" varlink C 
bindings, you hardly use the schema at all.

Something similar to the above issues is quite common in other formats. 
 For example, FlatBuffers[2]'s schema language[1] does have annotations 
(which it calls metadata), but then we would still have to invent some 
standard annotations and teach programs about them.  However, the deal 
breaker is that again there is no way to transmit the schema from the 
server to the client, and that users are supposed to precompile the 
schema using a code generator.

All that said, what we _could_ do is serialize the schema as JSON 
instead of using a binary format, like so:

  [{
    "name": "exits",
    "kind": "cumulative",
    "count": 1
  }, {
    "name": "halt_poll_fail_ns",
    "kind": "cumulative",
    "unit": "seconds",
    "scale": {"base": "pow2", exponent: -9},
    "count": 1
  }, ...]

while keeping the actual statistics as an array of u64 values just like 
in these patches.  The JSON representation of the schema would be always 
the same, so it could be treated as fixed-size and an efficient 
implementation of pread would be possible.  And once the schema is JSON, 
there could be a *meta*-schema expressed using the varlink language:

  # compare to struct kvm_stats_desc, included after my sig
  # for reference
  type StatsDescriptor {
    name: string,
    kind: (cumulative, instant, peak),
    unit: ?(bytes, seconds, cycles),
    scale: ?(base: (pow2, pow10), exponent: int),
    count: int
  }

  type StatsSchema {
    name: string,
    stats: []StatsDescriptor
  }

Varlink would *not* be used in the kernel, because Varlink isreally 
just JSON.  The above ten lines (plus comments) would be dropped in 
Documentation/ as a .varlink file; they would be a convenient way to 
describe (to either a human or a program) how to parse the JSON schema.

There would then be another problem, namely how to include the varlink 
schema in the generated documentation, since right now there are for 
example no sphinx bindings for varlink.  Nevertheless, if you think it's 
better to have the schema as JSON instead of binary, according to the 
above varlink meta-schema, then we can look into it.

Thanks,

Paolo

+	struct kvm_stats_desc {
+		__u32 flags;
+		__s16 exponent;
+		__u16 size;
+		__u32 offset;
+		__u32 unused;
+		char name[0];
+	};