On Tue, 7 Jan 2020, Coly Li wrote: > Now most of obvious deadlock or panic problems in bcache are fixed, so > bcache now can survie under very high I/O load until ... it consumpts > all system memory for bcache btree node cache, and the system hangs or > panics. > > The bcache btree node cache is used to cache the bcace btree node from > SSD to memory, when determine whether a data block is cached or not > by indexing the btree, the speed can be much more fast by an in-memory > search. > > Before this patch, there is no btree node cache memory limitation, just > a slab shrinker callback registered to slab memory manager. If the I/O > requests are coming faster than kernel memory management code to shrink > the bcache btree node cache memory, it is possible for bcache to consume > all available system memory for its btree node cache, and make whole > system hang or panic. On high performance machine with many CPU cores, > large memory size and SSD capanicity, it is often observed the whole > system gets hung or panic afer 12+ hours high small random I/O load > (e.g. 30K IOPS with 512 bytes random reads and writes). > > This patch tries to limit bcache btree node cache memory consumption by > I/O throttle. The idea is, > 1) Add kernel thread c->btree_cache_shrink_thread to shrink in-memory > cached btree nodes. > 2) Add a threshold c->btree_cache_threshold to limit number of in-memory > cached btree node, if c->btree_cache_used reaches the threshold, wake > up the shrink kernel thread to shrink in-memory cached btree nodes. > 3) In the shrink kernel thread, call __bch_mca_scan() with reap_flush > parameter set to true, then all candidate clean and dirty (flush to > SSD before reap) btree nodes will be shrunk. > 4) In the shrink kernel thread main loop, try to shrink 100 btree nodes > by calling __bch_mca_scan(). Inside __bch_mca_scan() c->bucket_lock > is held during shrinking all the 100 btree nodes. The shrinking will > stop until in-memory cached btree node number less than > c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS > MCA_SHRINK_DEFAULT_HYSTERESIS is used to avoid the shrinking kernel > thread is waken up too frequently, by default its value is 1000. > > The I/O throttle happens when __bch_mca_scan() is called in the while- > loop inside the shrink kernel thread main loop. > - Every time when __bch_mca_scan() is called, 100 btree nodes are about > to shrink. During __bch_mca_scan() shrinking all the btree nodes, > c->bucket_lock is held. > - When a write request coming, bcache needs to allocate a SSD space for > the cached data with c->bucket_lock held. If __bch_mca_scan() is > executing to shrink btree node memory, the allocation operation has to > wait for c->bucket_lock. > - When allocating in-memory btree node for new I/O request, mutex > c->bucket_lock is also required. If __bch_mca_scan() is running > new I/O request has to wait until the above 100 btree nodes are > shunk, then the new I/O request has chance to compete c->bucket_lock. > Once c->bucket_lock is acquired inside shrink kernel thread, all other > I/Os has to wait until the 100 in-memory btree node cache are shunk. > Then the I/O requests are throttled by shrink kernel thread until all > in-memory cached btree node number is less than, > c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS > > This is a simple but working method to limit bcache btree node cache > memory consumption by I/O throttle. > > By default c->btree_cache_threshold is 15000, if the btree node size is > 2MB (default bucket size), it is around 30GB. It means if the bcache > btree node cache accupies around 30GB memory, the shrink kernel thread > will wake up and start to shrink the bcache in-memory btree node cache. > > 30GB in-memory btree node cache is already big enough, it is reasonable > for a default threshold. If there is report in fture that people do want > to offer much more memory for bcache in-memory btree node cache, there > will be a sysfs interface later for such configuration. Is there already a sysfs that shows what is currently used by the cache? -- Eric Wheeler > > Signed-off-by: Coly Li <colyli@xxxxxxx> > --- > drivers/md/bcache/bcache.h | 3 +++ > drivers/md/bcache/btree.c | 63 ++++++++++++++++++++++++++++++++++++++++++++-- > drivers/md/bcache/btree.h | 3 +++ > drivers/md/bcache/super.c | 3 +++ > 4 files changed, 70 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index def15a6b4f7b..cff62084271b 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -589,6 +589,9 @@ struct cache_set { > struct task_struct *btree_cache_alloc_lock; > spinlock_t btree_cannibalize_lock; > > + unsigned int btree_cache_threshold; > + struct task_struct *btree_cache_shrink_thread; > + > /* > * When we free a btree node, we increment the gen of the bucket the > * node is in - but we can't rewrite the prios and gens until we > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index b37405aedf6e..ada17113482f 100644 > --- a/drivers/md/bcache/btree.c > +++ b/drivers/md/bcache/btree.c > @@ -835,13 +835,64 @@ void bch_btree_cache_free(struct cache_set *c) > mutex_unlock(&c->bucket_lock); > } > > +#define MCA_SHRINK_DEFAULT_HYSTERESIS 1000 > +static int bch_mca_shrink_thread(void *arg) > +{ > + > + struct cache_set *c = arg; > + > + while (!kthread_should_stop() && > + !test_bit(CACHE_SET_IO_DISABLE, &c->flags)) { > + > + set_current_state(TASK_INTERRUPTIBLE); > + if (c->btree_cache_used < c->btree_cache_threshold || > + c->shrinker_disabled || > + c->btree_cache_alloc_lock) { > + /* quit main loop if kthread should stop */ > + if (kthread_should_stop() || > + test_bit(CACHE_SET_IO_DISABLE, &c->flags)) { > + set_current_state(TASK_RUNNING); > + break; > + } > + schedule_timeout(30*HZ); > + continue; > + } > + set_current_state(TASK_RUNNING); > + > + /* Now shrink mca cache memory */ > + while (!kthread_should_stop() && > + !test_bit(CACHE_SET_IO_DISABLE, &c->flags) && > + ((c->btree_cache_used + MCA_SHRINK_DEFAULT_HYSTERESIS) >= > + c->btree_cache_threshold)) { > + struct shrink_control sc; > + > + sc.gfp_mask = GFP_KERNEL; > + sc.nr_to_scan = 100 * c->btree_pages; > + __bch_mca_scan(&c->shrink, &sc, true); > + cond_resched(); > + } /* shrink loop done */ > + } /* kthread loop done */ > + > + wait_for_kthread_stop(); > + return 0; > +} > + > int bch_btree_cache_alloc(struct cache_set *c) > { > unsigned int i; > > - for (i = 0; i < mca_reserve(c); i++) > - if (!mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL)) > + c->btree_cache_shrink_thread = > + kthread_create(bch_mca_shrink_thread, c, "bcache_mca_shrink"); > + if (IS_ERR_OR_NULL(c->btree_cache_shrink_thread)) > + return -ENOMEM; > + c->btree_cache_threshold = BTREE_CACHE_THRESHOLD_DEFAULT; > + > + for (i = 0; i < mca_reserve(c); i++) { > + if (!mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL)) { > + kthread_stop(c->btree_cache_shrink_thread); > return -ENOMEM; > + } > + } > > list_splice_init(&c->btree_cache, > &c->btree_cache_freeable); > @@ -961,6 +1012,14 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op, > if (mca_find(c, k)) > return NULL; > > + /* > + * If too many btree node cache allocated, wake up > + * the cache shrink thread to release btree node > + * cache memory. > + */ > + if (c->btree_cache_used >= c->btree_cache_threshold) > + wake_up_process(c->btree_cache_shrink_thread); > + > /* btree_free() doesn't free memory; it sticks the node on the end of > * the list. Check if there's any freed nodes there: > */ > diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h > index f4dcca449391..3b1a423fb593 100644 > --- a/drivers/md/bcache/btree.h > +++ b/drivers/md/bcache/btree.h > @@ -102,6 +102,9 @@ > #include "bset.h" > #include "debug.h" > > +/* For 2MB bucket, 15000 is around 30GB memory */ > +#define BTREE_CACHE_THRESHOLD_DEFAULT 15000 > + > struct btree_write { > atomic_t *journal; > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index 047d9881f529..dc9455ae81b9 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -1662,6 +1662,9 @@ static void cache_set_flush(struct closure *cl) > if (!IS_ERR_OR_NULL(c->gc_thread)) > kthread_stop(c->gc_thread); > > + if (!IS_ERR_OR_NULL(c->btree_cache_shrink_thread)) > + kthread_stop(c->btree_cache_shrink_thread); > + > if (!IS_ERR_OR_NULL(c->root)) > list_add(&c->root->list, &c->btree_cache); > > -- > 2.16.4 > >