Hello, On Wed, May 03, 2023 at 02:07:26PM -0400, Johannes Weiner wrote: ... > > * Because tracking starts when the script starts running, it doesn't know > > anything which has happened upto that point, so you gotta pay attention to > > handling e.g. handling frees which don't match allocs. It's kinda annoying > > but not a huge problem usually. There are ways to build in BPF progs into > > the kernel and load it early but I haven't experiemnted with it yet > > personally. > > Yeah, early loading is definitely important, especially before module > loading etc. > > One common usecase is that we see a machine in the wild with a high > amount of kernel memory disappearing somewhere that isn't voluntarily > reported in vmstat/meminfo. Reproducing it isn't always > practical. Something that records early and always (with acceptable > runtime overhead) would be the holy grail. > > Matching allocs to frees is doable using the pfn as the key for pages, > and virtual addresses for slab objects. > > The biggest issue I had when I tried with bpf was losing updates to > the map. IIRC there is some trylocking going on to avoid deadlocks > from nested contexts (alloc interrupted, interrupt frees). It doesn't > sound like an unsolvable problem, though. (cc'ing Alexei and Andrii) This is the same thing that I hit with sched_ext. BPF plugged it for struct_ops but I wonder whether it can be done for specific maps / progs - ie. just declare that a given map or prog is not to be accessed from NMI and bypass the trylock deadlock avoidance mechanism. But, yeah, this should be addressed from BPF side. > Another minor thing was the stack trace map exploding on a basically > infinite number of unique interrupt stacks. This could probably also > be solved by extending the trace extraction API to cut the frames off > at the context switch boundary. > > Taking a step back though, given the multitude of allocation sites in > the kernel, it's a bit odd that the only accounting we do is the tiny > fraction of voluntary vmstat/meminfo reporting. We try to cover the > biggest consumers with this of course, but it's always going to be > incomplete and is maintenance overhead too. There are on average > several gigabytes in unknown memory (total - known vmstats) on our > machines. It's difficult to detect regressions easily. And it's per > definition the unexpected cornercases that are the trickiest to track > down. So it might be doable with BPF, but it does feel like the kernel > should do a better job of tracking out of the box and without > requiring too much plumbing and somewhat fragile kernel allocation API > tracking and probing from userspace. Yeah, easy / default visibility argument does make sense to me. Thanks. -- tejun