[Answering for Emanuele because he's not available until Monday] On 07/05/20 19:45, Jonathan Adams wrote: > This is good work. As David Rientjes mentioned, I'm currently investigating > a similar project, based on a google-internal debugfs-based FS we call > "metricfs". It's > designed in a slightly different fashion than statsfs here is, and the > statistics exported are > mostly fed into our OpenTelemetry-like system. We're motivated by > wanting an upstreamed solution, so that we can upstream the metrics we > create that are of general interest, and lower the overall rebasing > burden for our tree. Cool. We included a public reading API exactly so that there could be other "frontends". I was mostly thinking of BPF as an in-tree user, but your metricfs could definitely use the reading API. > - the 8/16/32/64 signed/unsigned integers seems like a wart, and the > built-in support to grab any offset from a structure doesn't seem like > much of an advantage. A simpler interface would be to just support an> "integer" (possibly signed/unsigned) type, which is always 64-bit, and > allow the caller to provide a function pointer to retrieve the value, > with one or two void *s cbargs. Then the framework could provide an > offset-based callback (or callbacks) similar to the existing > functionality, and a similar one for per-CPU based statistics. A > second "clear" callback could be optionally provided to allow for > statistics to be cleared, as in your current proposal. Ok, so basically splitting get_simple_value into many separate callbacks. The callbacks would be in a struct like struct stats_fs_type { uint64_t (*get)(struct stats_fs_value *, void *); void (*clear)(struct stats_fs_value *, void *); bool signed; } static uint64_t stats_fs_get_u8(struct stats_fs_value *val, void *base) { return *((uint8_t *)(base + (uintptr_t)val->arg); } static void stats_fs_clear_u8(struct stats_fs_value *val, void *base) { *((uint8_t *)(base + (uintptr_t)val->arg) = 0; } struct stats_fs_type stats_fs_type_u8 = { stats_fs_get_u8, stats_fs_clear_u8, false }; and custom types can be defined using "&(struct stats_fs_type) {...}". > - Beyond the statistic's type, one *very* useful piece of metadata > for telemetry tools is knowing whether a given statistic is > "cumulative" (an unsigned counter which is only ever increased), as > opposed to a floating value (like "amount of memory used"). Good idea. Also, clearing does not make sense for a floating value, so we can use cumulative/floating to get a default for the mode: KVM statistics for example are mostly cumulative and mode 644, except a few that are floating and those are all mode 444. Therefore it makes sense to add cumulative/floating even before outputting it as metadata. > I'm more > concerned with getting the statistics model and capabilities right > from the beginning, because those are harder to adjust later. Agreed. > 1. Each metricfs metric can have one or two string or integer "keys". > If these exist, they expand the metric from a single value into a > multi-dimensional table. For example, we use this to report a hash > table we keep of functions calling "WARN()", in a 'warnings' > statistic: > > % cat .../warnings/values > x86_pmu_stop 1 > % > > Indicates that the x86_pmu_stop() function has had a WARN() fire once > since the system was booted. If multiple functions have fired > WARN()s, they are listed in this table with their own counts. [1] We > also use these to report per-CPU counters on a CPU-by-CPU basis: > > % cat .../irq_x86/NMI/values > 0 42 > 1 18 > ... one line per cpu > % cat .../rx_bytes/values > lo 501360681 > eth0 1457631256 These seem like two different things. The percpu and per-interface values are best represented as subordinate sources, one per CPU and one per interface. For interfaces I would just use a separate directory, but it doesn't really make sense for CPUs. So if we can cater for it in the model, it's better. For example: - add a new argument to statsfs_create_source and statsfs_create_values that makes it not create directories and files respectively. - add a new "aggregate function" STATS_FS_LIST that directs the parent to build a table of all the simple values below it We can also add a helper statsfs_add_values_percpu that creates a new source for each CPU, I think. The warnings one instead is a real hash table. It should be possible to implement it as some kind of customized aggregation, that is implemented in the client instead of coming from subordinate sources. The presentation can then just use STATS_FS_LIST. I don't see anything in the design that is a blocker. > 2. We also export some metadata about each statistic. For example, > the metadata for the NMI counter above looks like: > > % cat .../NMI/annotations > DESCRIPTION Non-maskable\ interrupts > CUMULATIVE > % cat .../NMI/fields > cpu value > int int > % Good idea. I would prefer per-directory dot-named files for this. For example a hypothetical statsfs version of /proc/interrupts could be like this: $ cat /sys/kernel/stats/interrupts/.schema 0 // Name CUMULATIVE // Flags int:int // Type(s) IR-IO-APIC 2-edge timer // Description ... LOC CUMULATIVE int:int Local timer interrupts ... $ cat /sys/kernel/stats/interrupts/LOC 0 4286815 1 4151572 2 4199361 3 4229248 > 3. We have a (very few) statistics where the value itself is a string, > usually for device statuses. Maybe in addition to CUMULATIVE and FLOATING we can have ENUM properties, and a table to convert those enums to strings. Aggregation could also be used to make a histogram out of enums in subordinate sources, e.g. $ cat /sys/kernel/stats/kvm/637-1/vcpu_state running 12 uninitialized 0 halted 4 So in general I'd say the sources/values model holds up. We certainly want to: - switch immediately to callbacks instead of the type constants (so that core statsfs code only does signed/unsigned) - add a field to distinguish cumulative and floating properties (and use it to determine the default file mode) - add a new argument to statsfs_create_source and statsfs_create_values that makes it not create directories and files respectively - add a new API to look for a statsfs_value recursively in all the subordinate sources, and pass the source/value pair to a callback function; and reimplement recursive aggregation and clear in terms of this function. > For our use cases, we generally don't both output a statistic and it's > aggregation from the kernel; either we sum up things in the kernel > (e.g. over a bunch of per-cpu or per-memcg counters) and only have the > result statistic, or we expect user-space to sum up the data if it's > interested. The tabular form makes it pretty easy to do so (i.e. you > can use awk(1) to sum all of the per-cpu NMI counters). Yep, the above "not create a dentry" flag would handle the case where you sum things up in the kernel because the more fine grained counters would be overwhelming. Paolo