[RFC PATCH] bcache: avoid unnecessary btree nodes flushing in btree_flush_write()

colyli@xxxxxxx · Sun, 24 Nov 2019 16:29:06 +0800

From: Coly Li <colyli@xxxxxxx>

Commit 91be66e1318f ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
  of clean btree nodes.
- Take c->btree_cache as a LRU-like list, aggressively flushing all
  dirty nodes from tail of c->btree_cache util the btree node with
  oldest journal entry is flushed. This is to reduce the time of holding
  c->bucket_lock.

Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying Commit 91be66e1318f. Guoju
Fang provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c->btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c->btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.

This patch tries to fix the unnecessary btree nodes flushhing by the
following methods,
- Get oldest journal entry pointer from c->journal.pin.front.
- Iterate c->btree_cache list from tail, to chck whether journal
  entry of the btree node (a.k.a btree_current_write(b)->journal) is
  equal to the address of previously fetched oldest journal entry
  pointer. If matches, quit the list iteration and flush this btree
  node.
- Still try to do aggressive btree node flush but more restrainedly.
  Before encounter the btree node with oldest journal entry, if a
  dirty btree node's journal entry is very close to the oldest journal
  entry (index distance < BTREE_FLUSH_NR), also flush it.

Now the extra unnecessary btree nodes flushing can be avoided, and
the best btree nodes can be found in O(n) time relative to how large
c->btree_cache list is.

Reported-by: Guoju Fang <fangguoju@xxxxxxxxx>
Reported-by: Shuang Li <psymon@xxxxxxxxxxxxx>
Signed-off-by: Coly Li <colyli@xxxxxxx>
---
 drivers/md/bcache/journal.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index be2a2a201603..43925f7afc3f 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -417,10 +417,15 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 
 /* Journalling */
 
+#define nr_to_fifo_front(front_p, p, mask)	(((p) - (front_p)) & (mask))
+
 static void btree_flush_write(struct cache_set *c)
 {
 	struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
 	unsigned int i, n;
+	unsigned int oldest_found;
+	atomic_t *fifo_front_p;
+	size_t mask;
 
 	if (c->journal.btree_flushing)
 		return;
@@ -435,7 +440,14 @@ static void btree_flush_write(struct cache_set *c)
 
 	atomic_long_inc(&c->flush_write);
 	memset(btree_nodes, 0, sizeof(btree_nodes));
+
+	spin_lock(&c->journal.lock);
+	fifo_front_p = &fifo_front(&c->journal.pin);
+	spin_unlock(&c->journal.lock);
+	mask = c->journal.pin.mask;
+
 	n = 0;
+	oldest_found = 0;
 
 	mutex_lock(&c->bucket_lock);
 	list_for_each_entry_safe_reverse(b, t, &c->btree_cache, list) {
@@ -454,16 +466,45 @@ static void btree_flush_write(struct cache_set *c)
 			continue;
 		}
 
+		/*
+		 * To avoid c->journal.lock inside the loop, fifo_front_p and
+		 * mask are fetched before the loop, and the distance to fifo
+		 * front is calculated by nr_to_fifo_front().
+		 * NOTE: the real journal.pin.front might be increased, it is
+		 * possible that journal.pin.front goes beyond (fifo_front_p +
+		 * BTREE_FLUSH_NR) and nothing selected in the loop. But it
+		 * doesn't hurt because the oldest journal entry is reclaimed
+		 * already.
+		 */
+		if (nr_to_fifo_front(btree_current_write(b)->journal,
+				     fifo_front_p,
+				     mask) >=
+		    BTREE_FLUSH_NR) {
+			mutex_unlock(&b->write_lock);
+			continue;
+		}
+
+		/* For debug purpose in RFC patch */
+		if (btree_current_write(b)->journal == fifo_front_p) {
+			oldest_found = 1;
+		}
+
 		set_btree_node_journal_flush(b);
 
 		mutex_unlock(&b->write_lock);
 
 		btree_nodes[n++] = b;
-		if (n == BTREE_FLUSH_NR)
+		if (n == BTREE_FLUSH_NR || oldest_found)
 			break;
 	}
 	mutex_unlock(&c->bucket_lock);
 
+	/* For debug purpose in RFC patch */
+	if (oldest_found)
+		pr_info_ratelimited("oldest journal entry found, %d bnode to flush.", n);
+	else
+		pr_info_ratelimited"oldest journal entry not found, %d bnode to flush.", n);
+
 	for (i = 0; i < n; i++) {
 		b = btree_nodes[i];
 		if (!b) {
-- 
2.16.4