On Mon, May 4, 2020 at 4:05 AM Emanuele Giuseppe Esposito <eesposit@xxxxxxxxxx> wrote: ... > Statsfs offers a generic and stable API, allowing any kind of > directory/file organization and supporting multiple kind of aggregations > (not only sum, but also average, max, min and count_zero) and data types > (all unsigned and signed types plus boolean). The implementation, which is > a generalization of KVM’s debugfs statistics code, takes care of gathering > and displaying information at run time; users only need to specify the > values to be included in each source. > > Statsfs would also be a different mountpoint from debugfs, and would not > suffer from limited access due to the security lock down patches. Its main > function is to display each statistics as a file in the desired folder > hierarchy defined through the API. Statsfs files can be read, and possibly > cleared if their file mode allows it. > > Statsfs has two main components: the public API defined by > include/linux/statsfs.h, and the virtual file system which should end up > in /sys/kernel/stats. This is good work. As David Rientjes mentioned, I'm currently investigating a similar project, based on a google-internal debugfs-based FS we call "metricfs". It's designed in a slightly different fashion than statsfs here is, and the statistics exported are mostly fed into our OpenTelemetry-like system. We're motivated by wanting an upstreamed solution, so that we can upstream the metrics we create that are of general interest, and lower the overall rebasing burden for our tree. Some feedback on your design as proposed: - the 8/16/32/64 signed/unsigned integers seems like a wart, and the built-in support to grab any offset from a structure doesn't seem like much of an advantage. A simpler interface would be to just support an "integer" (possibly signed/unsigned) type, which is always 64-bit, and allow the caller to provide a function pointer to retrieve the value, with one or two void *s cbargs. Then the framework could provide an offset-based callback (or callbacks) similar to the existing functionality, and a similar one for per-CPU based statistics. A second "clear" callback could be optionally provided to allow for statistics to be cleared, as in your current proposal. - A callback-style interface also allows for a lot more flexibility in sourcing values, and doesn't lock your callers into one way of storing them. You would, of course, have to be clear about locking rules etc. for the callbacks. - Beyond the statistic's type, one *very* useful piece of metadata for telemetry tools is knowing whether a given statistic is "cumulative" (an unsigned counter which is only ever increased), as opposed to a floating value (like "amount of memory used"). I agree with the folks asking for a binary interface to read statistics, but I also agree that it can be added on later. I'm more concerned with getting the statistics model and capabilities right from the beginning, because those are harder to adjust later. Would you be open to collaborating on the statsfs design? As background for this discussion, here are some details of how our metricfs implementation approaches statistics: 1. Each metricfs metric can have one or two string or integer "keys". If these exist, they expand the metric from a single value into a multi-dimensional table. For example, we use this to report a hash table we keep of functions calling "WARN()", in a 'warnings' statistic: % cat .../warnings/values x86_pmu_stop 1 % Indicates that the x86_pmu_stop() function has had a WARN() fire once since the system was booted. If multiple functions have fired WARN()s, they are listed in this table with their own counts. [1] We also use these to report per-CPU counters on a CPU-by-CPU basis: % cat .../irq_x86/NMI/values 0 42 1 18 ... one line per cpu % 2. We also export some metadata about each statistic. For example, the metadata for the NMI counter above looks like: % cat .../NMI/annotations DESCRIPTION Non-maskable\ interrupts CUMULATIVE % cat .../NMI/fields cpu value int int % (Describing the statistic, marking it as "cumulative", and saying the fields are "cpu" and "value", both ints). The metadata doesn't change much, so having separate files allows the user-space agent to read them once and then the values multiple times. 3. We have a (very few) statistics where the value itself is a string, usually for device statuses. For our use cases, we generally don't both output a statistic and it's aggregation from the kernel; either we sum up things in the kernel (e.g. over a bunch of per-cpu or per-memcg counters) and only have the result statistic, or we expect user-space to sum up the data if it's interested. The tabular form makes it pretty easy to do so (i.e. you can use awk(1) to sum all of the per-cpu NMI counters). We don't generally reset statistics, except as a side effect of removing a device. Thanks again for the patchset, and for pointing out that KVM also needs statistics sent out; it's great that there is interest in this. Cheers, - jonathan P.S. I also have a couple (non-critical) high-level notes: * It's not clear what tree your patches are against, or their dependencies; I was able to get them to apply to linux-next master with a little massaging, but then they failed to compile because they're built on top of your "libfs: group and simplify linux fs code" patch series you sent out in late april. Including a git link or at least a baseline tree and a list of the patch series you rely upon is helpful for anyone wanting to try out your changes. * The main reason I was trying to try out your patches was to get a sense of the set of directories and things the KVM example generates; while it's apparently the same as the existing KVM debugfs tree, it's useful to know how this ends up looking on a real system, and I'm not familiar with the KVM stats. Since this patch is intended slightly more broadly than just KVM, it might have been useful to include sample output for those not familiar with how things are today. [1] We also use this to export various network/storage statistics on a per-device basis. e.g. network bytes received counts: % cat .../rx_bytes/values lo 501360681 eth0 1457631256 ... %