On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote: > Hi all, > > statsfs is a proposal for a new Linux kernel synthetic filesystem, to be > mounted in /sys/kernel/stats, which exposes subsystem-level statistics > in sysfs. Reading need not be particularly lightweight, but writing > must be fast. Therefore, statistics are gathered at a fine-grain level > in order to avoid locking or atomic operations, and then aggregated by > statsfs until the desired granularity. Wait, reading a statistic from userspace can be slow, but writing to it from userspace has to be fast? Or do you mean the speed is all for reading/writing the value within the kernel? > The first user of statsfs would be KVM, which is currently exposing its > stats in debugfs. However, debugfs access is now limited by the > security lock down patches, and in addition statsfs aims to be a > more-or-less stable API, hence the idea of making it a separate > filesystem and mount point. Nice, I've had people ask about something like this for a while now. For the most part they just dump stuff in sysfs instead (see the DRM patches recently for people attempting to do that for debugfs values as well.) > A few people have already expressed interest in this. Christian > Borntraeger presented on the kvm_stat tool recently at KVM Forum and was > also thinking about using some high-level API in debugfs. Google has > KVM patches to gather statistics in a binary format; it may be useful to > add this kind of functionality (and some kind of introspection similar > to what tracing does) to statsfs too in the future, but this is > independent from the kernel API. I'm also CCing Alex Williamson, in > case VFIO is interested in something similar, and Steven Rostedt because > apparently he has enough free time to write poetry in addition to code. > > There are just two concepts in statsfs, namely "values" (aka files) and > "sources" (directories). > > A value represents a single quantity that is gathered by the statsfs > client. It could be the number of vmexits of a given kind, the amount > of memory used by some data structure, the length of the longest hash > table chain, or anything like that. > > Values are described by a struct like this one: > > struct statsfs_value { > const char *name; > enum stat_type type; /* STAT_TYPE_{BOOL,U64,...} */ > u16 aggr_kind; /* Bitmask with zero or more of > * STAT_AGGR_{MIN,MAX,SUM,...} > */ > u16 mode; /* File mode */ > int offset; /* Offset from base address > * to field containing the value > */ > }; > > As you can see, values are basically integers stored somewhere in a > struct. The statsfs_value struct also includes information on which > operations (for example sum, min, max, average, count nonzero) it makes > sense to expose when the values are aggregated. What can userspace do with that info? > Sources form the bulk of the statsfs API. They can include two kinds of > elements: > > - values as described above. The common case is to have many values > with the same base address, which are represented by an array of struct > statsfs_value > > - subordinate sources > > Adding a subordinate source has two effects: > > - it creates a subdirectory for each subordinate source > > - for each value in the subordinate sources which has aggr_kind != 0, > corresponding values will be created in the parent directory too. If > multiple subordinate sources are backed by the same array of struct > statsfs_value, values from all those sources will be aggregated. That > is, statsfs will compute these from the values of all items in the list > and show them in the parent directory. > > Writable values can only be written with a value of zero. Writing zero > to an aggregate zeroes all the corresponding values in the subordinate > sources. > > Sources are manipulated with these four functions: > > struct statsfs_source *statsfs_source_create(const char *fmt, > ...); > void statsfs_source_add_values(struct statsfs_source *source, > struct statsfs_value *stat, > int n, void *ptr); > void statsfs_source_add_subordinate( > struct statsfs_source *source, > struct statsfs_source *sub); > void statsfs_source_remove_subordinate( > struct statsfs_source *source, > struct statsfs_source *sub); > > Sources are reference counted, and for this reason there is also a pair > of functions in the usual style: > > void statsfs_source_get(struct statsfs_source *); > void statsfs_source_put(struct statsfs_source *); > > Finally, > > void statsfs_source_register(struct statsfs_source *source); > > lets you create a toplevel statsfs directory. > > As a practical example, KVM's usage of debugfs could be replaced by > something like this: > > /* Globals */ > struct statsfs_value vcpu_stats[] = ...; > struct statsfs_value vm_stats[] = ...; > static struct statsfs_source *kvm_source; > > /* On module creation */ > kvm_source = statsfs_source_create("kvm"); > statsfs_source_register(kvm_source); > > /* On VM creation */ > kvm->src = statsfs_source_create("%d-%d\n", > task_pid_nr(current), fd); > statsfs_source_add_values(kvm->src, vm_stats, > ARRAY_SIZE(vm_stats), > &kvm->stats); > statsfs_source_add_subordinate(kvm_source, kvm->src); > > /* On vCPU creation */ > vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id); > statsfs_source_add_values(vcpu_src, vcpu_stats, > ARRAY_SIZE(vcpu_stats), > &vcpu->stats); > statsfs_source_add_subordinate(kvm->src, vcpu_src); > /* > * No need to keep the vcpu_src around since there's no > * separate vCPU deletion event; rely on refcount > * exclusively. > */ > statsfs_source_put(vcpu_src); > > /* On VM deletion */ > statsfs_source_remove_subordinate(kvm_source, kvm->src); > statsfs_source_put(kvm->src); > > /* On KVM exit */ > statsfs_source_put(kvm_source); > > How does this look? Where does the actual values get changed that get reflected in the filesystem? I have some old notes somewhere about what people really want when it comes to a good "statistics" datatype, that I was thinking of building off of, but that seems independant of what you are doing here, right? This is just exporting existing values to userspace in a semi-sane way? Anyway, I like the idea, but what about how this is exposed to userspace? The criticism of sysfs for statistics is that it is too slow to open/read/close lots of files and tough to get "at this moment in time these are all the different values" snapshots easily. How will this be addressed here? thanks, greg k-h