On Thu, Sep 23, 2021 at 3:44 PM Marco Elver <elver@xxxxxxxxxx> wrote: > > On Thu, 23 Sept 2021 at 15:24, Alexander Potapenko <glider@xxxxxxxxxx> wrote: > > > > On Thu, Sep 23, 2021 at 1:19 PM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: > > > > > > On Thu, 23 Sept 2021 at 12:48, Marco Elver <elver@xxxxxxxxxx> wrote: > > > > > > > > One of KFENCE's main design principles is that with increasing uptime, > > > > allocation coverage increases sufficiently to detect previously > > > > undetected bugs. > > > > > > > > We have observed that frequent long-lived allocations of the same > > > > source (e.g. pagecache) tend to permanently fill up the KFENCE pool > > > > with increasing system uptime, thus breaking the above requirement. > > > > The workaround thus far had been increasing the sample interval and/or > > > > increasing the KFENCE pool size, but is no reliable solution. > > > > > > > > To ensure diverse coverage of allocations, limit currently covered > > > > allocations of the same source once pool utilization reaches 75% > > > > (configurable via `kfence.skip_covered_thresh`) or above. The effect is > > > > retaining reasonable allocation coverage when the pool is close to full. > > > > > > > > A side-effect is that this also limits frequent long-lived allocations > > > > of the same source filling up the pool permanently. > > > > > > > > Uniqueness of an allocation for coverage purposes is based on its > > > > (partial) allocation stack trace (the source). A Counting Bloom filter > > > > is used to check if an allocation is covered; if the allocation is > > > > currently covered, the allocation is skipped by KFENCE. > > > > > > > > Testing was done using: > > > > > > > > (a) a synthetic workload that performs frequent long-lived > > > > allocations (default config values; sample_interval=1; > > > > num_objects=63), and > > > > > > > > (b) normal desktop workloads on an otherwise idle machine where > > > > the problem was first reported after a few days of uptime > > > > (default config values). > > > > > > > > In both test cases the sampled allocation rate no longer drops to zero > > > > at any point. In the case of (b) we observe (after 2 days uptime) 15% > > > > unique allocations in the pool, 77% pool utilization, with 20% "skipped > > > > allocations (covered)". > > > > > > > > Signed-off-by: Marco Elver <elver@xxxxxxxxxx> > > > > > > Reviewed-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx> > > Acked-by: Alexander Potapenko <glider@xxxxxxxxxx> > > Thank you both! > > > > > --- > > > > v3: > > > > * Remove unneeded !alloc_stack_hash checks. > > > > * Remove unneeded meta->alloc_stack_hash=0 in kfence_guarded_free(). > > > > > > > > v2: > > > > * Switch to counting bloom filter to guarantee currently covered > > > > allocations being skipped. > > > > * Use a module param for skip_covered threshold. > > > > * Use kfence pool address as hash entropy. > > > > * Use filter_irq_stacks(). > > > > --- > > > > mm/kfence/core.c | 103 ++++++++++++++++++++++++++++++++++++++++++++- > > > > mm/kfence/kfence.h | 2 + > > > > 2 files changed, 103 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > > > > index db01814f8ff0..58a0f6f1acc5 100644 > > > > --- a/mm/kfence/core.c > > > > +++ b/mm/kfence/core.c > > > > @@ -11,11 +11,13 @@ > > > > #include <linux/bug.h> > > > > #include <linux/debugfs.h> > > > > #include <linux/irq_work.h> > > > > +#include <linux/jhash.h> > > > > #include <linux/kcsan-checks.h> > > > > #include <linux/kfence.h> > > > > #include <linux/kmemleak.h> > > > > #include <linux/list.h> > > > > #include <linux/lockdep.h> > > > > +#include <linux/log2.h> > > > > #include <linux/memblock.h> > > > > #include <linux/moduleparam.h> > > > > #include <linux/random.h> > > > > @@ -82,6 +84,10 @@ static const struct kernel_param_ops sample_interval_param_ops = { > > > > }; > > > > module_param_cb(sample_interval, &sample_interval_param_ops, &kfence_sample_interval, 0600); > > > > > > > > +/* Pool usage% threshold when currently covered allocations are skipped. */ > > > > +static unsigned long kfence_skip_covered_thresh __read_mostly = 75; > > > > +module_param_named(skip_covered_thresh, kfence_skip_covered_thresh, ulong, 0644); > > > > + > > > > /* The pool of pages used for guard pages and objects. */ > > > > char *__kfence_pool __ro_after_init; > > > > EXPORT_SYMBOL(__kfence_pool); /* Export for test modules. */ > > > > @@ -105,6 +111,25 @@ DEFINE_STATIC_KEY_FALSE(kfence_allocation_key); > > > > /* Gates the allocation, ensuring only one succeeds in a given period. */ > > > > atomic_t kfence_allocation_gate = ATOMIC_INIT(1); > > > > > > > > +/* > > > > + * A Counting Bloom filter of allocation coverage: limits currently covered > > > > + * allocations of the same source filling up the pool. > > > > + * > > > > + * Assuming a range of 15%-85% unique allocations in the pool at any point in > > > > Where do these 85% come from? > > An imaginary worst case, just to illustrate the range of the false > positive probabilities (in the case of 85% it'd be 0.33). I expect > unique allocations to be around 10-15% on a freshly booted system (on > my real-system-experiment it stayed below 15%), but other workloads > may produce other unique allocations%. > > > > > + * time, the below parameters provide a probablity of 0.02-0.33 for false > > > > + * positive hits respectively: > > > > + * > > > > + * P(alloc_traces) = (1 - e^(-HNUM * (alloc_traces / SIZE)) ^ HNUM > > > > + */ > > > > +#define ALLOC_COVERED_HNUM 2 > > > > +#define ALLOC_COVERED_SIZE (1 << (const_ilog2(CONFIG_KFENCE_NUM_OBJECTS) + 2)) > > > > +#define ALLOC_COVERED_HNEXT(h) (1664525 * (h) + 1013904223) > > > > Unless we are planning to change these primes, can you use > > next_pseudo_random32() instead? > > I'm worried about next_pseudo_random32() changing their implementation > to longer be deterministic or change in other ways that break our > usecase. In this case we want pseudorandomness, but we're not > implementing a PRNG. > > Open-coding the constants (given they are from "Numerical Recipes") is > more reliable and doesn't introduce unwanted reliance on > next_pseudo_random32()'s behaviour. Okay, fair enough. > > Thanks, > -- Marco -- Alexander Potapenko Software Engineer Google Germany GmbH Erika-Mann-Straße, 33 80636 München Geschäftsführer: Paul Manicle, Halimah DeLaine Prado Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg