On Wed, May 3, 2023 at 9:35 AM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, Kent. > > On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote: > > No, we're still waiting on the tracing people to _demonstrate_, not > > claim, that this is at all possible in a comparable way with tracing. > > So, we (meta) happen to do stuff like this all the time in the fleet to hunt > down tricky persistent problems like memory leaks, ref leaks, what-have-you. > In recent kernels, with kprobe and BPF, our ability to debug these sorts of > problems has improved a great deal. Below, I'm attaching a bcc script I used > to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks > can follow the same pattern. Thanks for sharing, Tejun! > > There are of course some pros and cons to this approach: > > Pros: > > * The framework doesn't really have any runtime overhead, so we can have it > deployed in the entire fleet and debug wherever problem is. Do you mean it has no runtime overhead when disabled? If so, do you know what's the overhead when enabled? I want to understand if that's truly a viable solution to track all allocations (including slab) all the time. Thanks, Suren. > > * It's fully flexible and programmable which enables non-trivial filtering > and summarizing to be done inside kernel w/ BPF as necessary, which is > pretty handy for tracking high frequency events. > > * BPF is pretty performant. Dedicated built-in kernel code can do better of > course but BPF's jit compiled code & its data structures are fast enough. > I don't remember any time this was a problem. > > Cons: > > * BPF has some learning curve. Also the fact that what it provides is a wide > open field rather than something scoped out for a specific problem can > make it seem a bit daunting at the beginning. > > * Because tracking starts when the script starts running, it doesn't know > anything which has happened upto that point, so you gotta pay attention to > handling e.g. handling frees which don't match allocs. It's kinda annoying > but not a huge problem usually. There are ways to build in BPF progs into > the kernel and load it early but I haven't experiemnted with it yet > personally. > > I'm not necessarily against adding dedicated memory debugging mechanism but > do wonder whether the extra benefits would be enough to justify the code and > maintenance overhead. > > Oh, a bit of delta but for anyone who's more interested in debugging > problems like this, while I tend to go for bcc > (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to > write against libbpf directly or use bpftrace > (https://github.com/iovisor/bpftrace). > > Thanks. > > #!/usr/bin/env bcc-py > > import bcc > import time > import datetime > import argparse > import os > import sys > import errno > > description = """ > Record vmalloc/vfrees and trigger on unmatched vfree > """ > > bpf_source = """ > #include <uapi/linux/ptrace.h> > #include <linux/vmalloc.h> > > struct vmalloc_rec { > unsigned long ptr; > int last_alloc_stkid; > int last_free_stkid; > int this_stkid; > bool allocated; > }; > > BPF_STACK_TRACE(stacks, 8192); > BPF_HASH(vmallocs, unsigned long, struct vmalloc_rec, 131072); > BPF_ARRAY(dup_free, struct vmalloc_rec, 1); > > int kpret_vmalloc_node_range(struct pt_regs *ctx) > { > unsigned long ptr = PT_REGS_RC(ctx); > uint32_t zkey = 0; > struct vmalloc_rec rec_init = { }; > struct vmalloc_rec *rec; > int stkid; > > if (!ptr) > return 0; > > stkid = stacks.get_stackid(ctx, 0); > > rec_init.ptr = ptr; > rec_init.last_alloc_stkid = -1; > rec_init.last_free_stkid = -1; > rec_init.this_stkid = -1; > > rec = vmallocs.lookup_or_init(&ptr, &rec_init); > rec->allocated = true; > rec->last_alloc_stkid = stkid; > return 0; > } > > int kp_vfree(struct pt_regs *ctx, const void *addr) > { > unsigned long ptr = (unsigned long)addr; > uint32_t zkey = 0; > struct vmalloc_rec rec_init = { }; > struct vmalloc_rec *rec; > int stkid; > > stkid = stacks.get_stackid(ctx, 0); > > rec_init.ptr = ptr; > rec_init.last_alloc_stkid = -1; > rec_init.last_free_stkid = -1; > rec_init.this_stkid = -1; > > rec = vmallocs.lookup_or_init(&ptr, &rec_init); > if (!rec->allocated && rec->last_alloc_stkid >= 0) { > rec->this_stkid = stkid; > dup_free.update(&zkey, rec); > } > > rec->allocated = false; > rec->last_free_stkid = stkid; > return 0; > } > """ > > bpf = bcc.BPF(text=bpf_source) > bpf.attach_kretprobe(event="__vmalloc_node_range", fn_name="kpret_vmalloc_node_range"); > bpf.attach_kprobe(event="vfree", fn_name="kp_vfree"); > bpf.attach_kprobe(event="vfree_atomic", fn_name="kp_vfree"); > > stacks = bpf["stacks"] > vmallocs = bpf["vmallocs"] > dup_free = bpf["dup_free"] > last_dup_free_ptr = dup_free[0].ptr > > def print_stack(stkid): > for addr in stacks.walk(stkid): > sym = bpf.ksym(addr) > print(' {}'.format(sym)) > > def print_dup(dup): > print('allocated={} ptr={}'.format(dup.allocated, hex(dup.ptr))) > if (dup.last_alloc_stkid >= 0): > print('last_alloc_stack: ') > print_stack(dup.last_alloc_stkid) > if (dup.last_free_stkid >= 0): > print('last_free_stack: ') > print_stack(dup.last_free_stkid) > if (dup.this_stkid >= 0): > print('this_stack: ') > print_stack(dup.this_stkid) > > while True: > time.sleep(1) > > if dup_free[0].ptr != last_dup_free_ptr: > print('\nDUP_FREE:') > print_dup(dup_free[0]) > last_dup_free_ptr = dup_free[0].ptr > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx. >