Re: [RFC PATCH 5/7] bcache: limit bcache btree node cache memory consumption by I/O throttle

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 7 Jan 2020, Coly Li wrote:

> Now most of obvious deadlock or panic problems in bcache are fixed, so
> bcache now can survie under very high I/O load until ... it consumpts
> all system memory for bcache btree node cache, and the system hangs or
> panics.
> 
> The bcache btree node cache is used to cache the bcace btree node from
> SSD to memory, when determine whether a data block is cached or not
> by indexing the btree, the speed can be much more fast by an in-memory
> search.
> 
> Before this patch, there is no btree node cache memory limitation, just
> a slab shrinker callback registered to slab memory manager. If the I/O
> requests are coming faster than kernel memory management code to shrink
> the bcache btree node cache memory, it is possible for bcache to consume
> all available system memory for its btree node cache, and make whole
> system hang or panic. On high performance machine with many CPU cores,
> large memory size and SSD capanicity, it is often observed the whole
> system gets hung or panic afer 12+ hours high small random I/O load
> (e.g. 30K IOPS with 512 bytes random reads and writes).
> 
> This patch tries to limit bcache btree node cache memory consumption by
> I/O throttle. The idea is,
> 1) Add kernel thread c->btree_cache_shrink_thread to shrink in-memory
>    cached btree nodes.
> 2) Add a threshold c->btree_cache_threshold to limit number of in-memory
>    cached btree node, if c->btree_cache_used reaches the threshold, wake
>    up the shrink kernel thread to shrink in-memory cached btree nodes.
> 3) In the shrink kernel thread, call __bch_mca_scan() with reap_flush
>    parameter set to true, then all candidate clean and dirty (flush to
>    SSD before reap) btree nodes will be shrunk.
> 4) In the shrink kernel thread main loop, try to shrink 100 btree nodes
>    by calling __bch_mca_scan(). Inside __bch_mca_scan() c->bucket_lock
>    is held during shrinking all the 100 btree nodes. The shrinking will
>    stop until in-memory cached btree node number less than
> 	c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS
>    MCA_SHRINK_DEFAULT_HYSTERESIS is used to avoid the shrinking kernel
>    thread is waken up too frequently, by default its value is 1000.
> 
> The I/O throttle happens when __bch_mca_scan() is called in the while-
> loop inside the shrink kernel thread main loop.
> - Every time when __bch_mca_scan() is called, 100 btree nodes are about
>   to shrink. During __bch_mca_scan() shrinking all the btree nodes,
>   c->bucket_lock is held.
> - When a write request coming, bcache needs to allocate a SSD space for
>   the cached data with c->bucket_lock held. If __bch_mca_scan() is
>   executing to shrink btree node memory, the allocation operation has to
>   wait for c->bucket_lock.
> - When allocating in-memory btree node for new I/O request, mutex
>   c->bucket_lock is also required. If __bch_mca_scan() is running
>   new I/O request has to wait until the above 100 btree nodes are
>   shunk, then the new I/O request has chance to compete c->bucket_lock.
> Once c->bucket_lock is acquired inside shrink kernel thread, all other
> I/Os has to wait until the 100 in-memory btree node cache are shunk.
> Then the I/O requests are throttled by shrink kernel thread until all
> in-memory cached btree node number is less than,
> 	c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS
> 
> This is a simple but working method to limit bcache btree node cache
> memory consumption by I/O throttle.
> 
> By default c->btree_cache_threshold is 15000, if the btree node size is
> 2MB (default bucket size), it is around 30GB. It means if the bcache
> btree node cache accupies around 30GB memory, the shrink kernel thread
> will wake up and start to shrink the bcache in-memory btree node cache.
> 
> 30GB in-memory btree node cache is already big enough, it is reasonable
> for a default threshold. If there is report in fture that people do want
> to offer much more memory for bcache in-memory btree node cache, there
> will be a sysfs interface later for such configuration.

Is there already a sysfs that shows what is currently used by the cache?

--
Eric Wheeler



> 
> Signed-off-by: Coly Li <colyli@xxxxxxx>
> ---
>  drivers/md/bcache/bcache.h |  3 +++
>  drivers/md/bcache/btree.c  | 63 ++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/md/bcache/btree.h  |  3 +++
>  drivers/md/bcache/super.c  |  3 +++
>  4 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index def15a6b4f7b..cff62084271b 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -589,6 +589,9 @@ struct cache_set {
>  	struct task_struct	*btree_cache_alloc_lock;
>  	spinlock_t		btree_cannibalize_lock;
>  
> +	unsigned int		btree_cache_threshold;
> +	struct task_struct	*btree_cache_shrink_thread;
> +
>  	/*
>  	 * When we free a btree node, we increment the gen of the bucket the
>  	 * node is in - but we can't rewrite the prios and gens until we
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index b37405aedf6e..ada17113482f 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -835,13 +835,64 @@ void bch_btree_cache_free(struct cache_set *c)
>  	mutex_unlock(&c->bucket_lock);
>  }
>  
> +#define MCA_SHRINK_DEFAULT_HYSTERESIS 1000
> +static int bch_mca_shrink_thread(void *arg)
> +{
> +
> +	struct cache_set *c = arg;
> +
> +	while (!kthread_should_stop() &&
> +	       !test_bit(CACHE_SET_IO_DISABLE, &c->flags)) {
> +
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		if (c->btree_cache_used < c->btree_cache_threshold ||
> +		    c->shrinker_disabled ||
> +		    c->btree_cache_alloc_lock) {
> +			/* quit main loop if kthread should stop */
> +			if (kthread_should_stop() ||
> +			    test_bit(CACHE_SET_IO_DISABLE, &c->flags)) {
> +				set_current_state(TASK_RUNNING);
> +				break;
> +			}
> +			schedule_timeout(30*HZ);
> +			continue;
> +		}
> +		set_current_state(TASK_RUNNING);
> +
> +		/* Now shrink mca cache memory */
> +		while (!kthread_should_stop() &&
> +		       !test_bit(CACHE_SET_IO_DISABLE, &c->flags) &&
> +		       ((c->btree_cache_used + MCA_SHRINK_DEFAULT_HYSTERESIS) >=
> +			c->btree_cache_threshold)) {
> +				struct shrink_control sc;
> +
> +				sc.gfp_mask = GFP_KERNEL;
> +				sc.nr_to_scan = 100 * c->btree_pages;
> +				__bch_mca_scan(&c->shrink, &sc, true);
> +				cond_resched();
> +		} /* shrink loop done */
> +	} /* kthread loop done */
> +
> +	wait_for_kthread_stop();
> +	return 0;
> +}
> +
>  int bch_btree_cache_alloc(struct cache_set *c)
>  {
>  	unsigned int i;
>  
> -	for (i = 0; i < mca_reserve(c); i++)
> -		if (!mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL))
> +	c->btree_cache_shrink_thread =
> +		kthread_create(bch_mca_shrink_thread, c, "bcache_mca_shrink");
> +	if (IS_ERR_OR_NULL(c->btree_cache_shrink_thread))
> +		return -ENOMEM;
> +	c->btree_cache_threshold = BTREE_CACHE_THRESHOLD_DEFAULT;
> +
> +	for (i = 0; i < mca_reserve(c); i++) {
> +		if (!mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL)) {
> +			kthread_stop(c->btree_cache_shrink_thread);
>  			return -ENOMEM;
> +		}
> +	}
>  
>  	list_splice_init(&c->btree_cache,
>  			 &c->btree_cache_freeable);
> @@ -961,6 +1012,14 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
>  	if (mca_find(c, k))
>  		return NULL;
>  
> +	/*
> +	 * If too many btree node cache allocated, wake up
> +	 * the cache shrink thread to release btree node
> +	 * cache memory.
> +	 */
> +	if (c->btree_cache_used >= c->btree_cache_threshold)
> +		wake_up_process(c->btree_cache_shrink_thread);
> +
>  	/* btree_free() doesn't free memory; it sticks the node on the end of
>  	 * the list. Check if there's any freed nodes there:
>  	 */
> diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
> index f4dcca449391..3b1a423fb593 100644
> --- a/drivers/md/bcache/btree.h
> +++ b/drivers/md/bcache/btree.h
> @@ -102,6 +102,9 @@
>  #include "bset.h"
>  #include "debug.h"
>  
> +/* For 2MB bucket, 15000 is around 30GB memory */
> +#define BTREE_CACHE_THRESHOLD_DEFAULT 15000
> +
>  struct btree_write {
>  	atomic_t		*journal;
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 047d9881f529..dc9455ae81b9 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1662,6 +1662,9 @@ static void cache_set_flush(struct closure *cl)
>  	if (!IS_ERR_OR_NULL(c->gc_thread))
>  		kthread_stop(c->gc_thread);
>  
> +	if (!IS_ERR_OR_NULL(c->btree_cache_shrink_thread))
> +		kthread_stop(c->btree_cache_shrink_thread);
> +
>  	if (!IS_ERR_OR_NULL(c->root))
>  		list_add(&c->root->list, &c->btree_cache);
>  
> -- 
> 2.16.4
> 
> 



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux