[RFC PATCH 5/7] bcache: limit bcache btree node cache memory consumption by I/O throttle

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Now most of obvious deadlock or panic problems in bcache are fixed, so
bcache now can survie under very high I/O load until ... it consumpts
all system memory for bcache btree node cache, and the system hangs or
panics.

The bcache btree node cache is used to cache the bcace btree node from
SSD to memory, when determine whether a data block is cached or not
by indexing the btree, the speed can be much more fast by an in-memory
search.

Before this patch, there is no btree node cache memory limitation, just
a slab shrinker callback registered to slab memory manager. If the I/O
requests are coming faster than kernel memory management code to shrink
the bcache btree node cache memory, it is possible for bcache to consume
all available system memory for its btree node cache, and make whole
system hang or panic. On high performance machine with many CPU cores,
large memory size and SSD capanicity, it is often observed the whole
system gets hung or panic afer 12+ hours high small random I/O load
(e.g. 30K IOPS with 512 bytes random reads and writes).

This patch tries to limit bcache btree node cache memory consumption by
I/O throttle. The idea is,
1) Add kernel thread c->btree_cache_shrink_thread to shrink in-memory
   cached btree nodes.
2) Add a threshold c->btree_cache_threshold to limit number of in-memory
   cached btree node, if c->btree_cache_used reaches the threshold, wake
   up the shrink kernel thread to shrink in-memory cached btree nodes.
3) In the shrink kernel thread, call __bch_mca_scan() with reap_flush
   parameter set to true, then all candidate clean and dirty (flush to
   SSD before reap) btree nodes will be shrunk.
4) In the shrink kernel thread main loop, try to shrink 100 btree nodes
   by calling __bch_mca_scan(). Inside __bch_mca_scan() c->bucket_lock
   is held during shrinking all the 100 btree nodes. The shrinking will
   stop until in-memory cached btree node number less than
	c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS
   MCA_SHRINK_DEFAULT_HYSTERESIS is used to avoid the shrinking kernel
   thread is waken up too frequently, by default its value is 1000.

The I/O throttle happens when __bch_mca_scan() is called in the while-
loop inside the shrink kernel thread main loop.
- Every time when __bch_mca_scan() is called, 100 btree nodes are about
  to shrink. During __bch_mca_scan() shrinking all the btree nodes,
  c->bucket_lock is held.
- When a write request coming, bcache needs to allocate a SSD space for
  the cached data with c->bucket_lock held. If __bch_mca_scan() is
  executing to shrink btree node memory, the allocation operation has to
  wait for c->bucket_lock.
- When allocating in-memory btree node for new I/O request, mutex
  c->bucket_lock is also required. If __bch_mca_scan() is running
  new I/O request has to wait until the above 100 btree nodes are
  shunk, then the new I/O request has chance to compete c->bucket_lock.
Once c->bucket_lock is acquired inside shrink kernel thread, all other
I/Os has to wait until the 100 in-memory btree node cache are shunk.
Then the I/O requests are throttled by shrink kernel thread until all
in-memory cached btree node number is less than,
	c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS

This is a simple but working method to limit bcache btree node cache
memory consumption by I/O throttle.

By default c->btree_cache_threshold is 15000, if the btree node size is
2MB (default bucket size), it is around 30GB. It means if the bcache
btree node cache accupies around 30GB memory, the shrink kernel thread
will wake up and start to shrink the bcache in-memory btree node cache.

30GB in-memory btree node cache is already big enough, it is reasonable
for a default threshold. If there is report in fture that people do want
to offer much more memory for bcache in-memory btree node cache, there
will be a sysfs interface later for such configuration.

Signed-off-by: Coly Li <colyli@xxxxxxx>
---
 drivers/md/bcache/bcache.h |  3 +++
 drivers/md/bcache/btree.c  | 63 ++++++++++++++++++++++++++++++++++++++++++++--
 drivers/md/bcache/btree.h  |  3 +++
 drivers/md/bcache/super.c  |  3 +++
 4 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index def15a6b4f7b..cff62084271b 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -589,6 +589,9 @@ struct cache_set {
 	struct task_struct	*btree_cache_alloc_lock;
 	spinlock_t		btree_cannibalize_lock;
 
+	unsigned int		btree_cache_threshold;
+	struct task_struct	*btree_cache_shrink_thread;
+
 	/*
 	 * When we free a btree node, we increment the gen of the bucket the
 	 * node is in - but we can't rewrite the prios and gens until we
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index b37405aedf6e..ada17113482f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -835,13 +835,64 @@ void bch_btree_cache_free(struct cache_set *c)
 	mutex_unlock(&c->bucket_lock);
 }
 
+#define MCA_SHRINK_DEFAULT_HYSTERESIS 1000
+static int bch_mca_shrink_thread(void *arg)
+{
+
+	struct cache_set *c = arg;
+
+	while (!kthread_should_stop() &&
+	       !test_bit(CACHE_SET_IO_DISABLE, &c->flags)) {
+
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (c->btree_cache_used < c->btree_cache_threshold ||
+		    c->shrinker_disabled ||
+		    c->btree_cache_alloc_lock) {
+			/* quit main loop if kthread should stop */
+			if (kthread_should_stop() ||
+			    test_bit(CACHE_SET_IO_DISABLE, &c->flags)) {
+				set_current_state(TASK_RUNNING);
+				break;
+			}
+			schedule_timeout(30*HZ);
+			continue;
+		}
+		set_current_state(TASK_RUNNING);
+
+		/* Now shrink mca cache memory */
+		while (!kthread_should_stop() &&
+		       !test_bit(CACHE_SET_IO_DISABLE, &c->flags) &&
+		       ((c->btree_cache_used + MCA_SHRINK_DEFAULT_HYSTERESIS) >=
+			c->btree_cache_threshold)) {
+				struct shrink_control sc;
+
+				sc.gfp_mask = GFP_KERNEL;
+				sc.nr_to_scan = 100 * c->btree_pages;
+				__bch_mca_scan(&c->shrink, &sc, true);
+				cond_resched();
+		} /* shrink loop done */
+	} /* kthread loop done */
+
+	wait_for_kthread_stop();
+	return 0;
+}
+
 int bch_btree_cache_alloc(struct cache_set *c)
 {
 	unsigned int i;
 
-	for (i = 0; i < mca_reserve(c); i++)
-		if (!mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL))
+	c->btree_cache_shrink_thread =
+		kthread_create(bch_mca_shrink_thread, c, "bcache_mca_shrink");
+	if (IS_ERR_OR_NULL(c->btree_cache_shrink_thread))
+		return -ENOMEM;
+	c->btree_cache_threshold = BTREE_CACHE_THRESHOLD_DEFAULT;
+
+	for (i = 0; i < mca_reserve(c); i++) {
+		if (!mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL)) {
+			kthread_stop(c->btree_cache_shrink_thread);
 			return -ENOMEM;
+		}
+	}
 
 	list_splice_init(&c->btree_cache,
 			 &c->btree_cache_freeable);
@@ -961,6 +1012,14 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
 	if (mca_find(c, k))
 		return NULL;
 
+	/*
+	 * If too many btree node cache allocated, wake up
+	 * the cache shrink thread to release btree node
+	 * cache memory.
+	 */
+	if (c->btree_cache_used >= c->btree_cache_threshold)
+		wake_up_process(c->btree_cache_shrink_thread);
+
 	/* btree_free() doesn't free memory; it sticks the node on the end of
 	 * the list. Check if there's any freed nodes there:
 	 */
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index f4dcca449391..3b1a423fb593 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -102,6 +102,9 @@
 #include "bset.h"
 #include "debug.h"
 
+/* For 2MB bucket, 15000 is around 30GB memory */
+#define BTREE_CACHE_THRESHOLD_DEFAULT 15000
+
 struct btree_write {
 	atomic_t		*journal;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 047d9881f529..dc9455ae81b9 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1662,6 +1662,9 @@ static void cache_set_flush(struct closure *cl)
 	if (!IS_ERR_OR_NULL(c->gc_thread))
 		kthread_stop(c->gc_thread);
 
+	if (!IS_ERR_OR_NULL(c->btree_cache_shrink_thread))
+		kthread_stop(c->btree_cache_shrink_thread);
+
 	if (!IS_ERR_OR_NULL(c->root))
 		list_add(&c->root->list, &c->btree_cache);
 
-- 
2.16.4




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux