Re: [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay

Hannes Reinecke <hare@xxxxxxx> · Tue, 23 Apr 2019 08:59:38 +0200

On 4/19/19 6:04 PM, Coly Li wrote:
A deadlock of bcache jouranling may happen during journal replay. Such
deadlock happens when,
- Journal space is totally full (no any free blocks) and system crashes
   or reboots.
- After reboot, the first journal entry handled by jouranl replay causes
   btree split and jouranl_meta() is called to write an empty jset to
   journal space.
- There is no journal space to write and journal_reclaim() fails to get
   any available bucket because this is the first replayed journal entry
   to be blocked.
Then the whole cache set is blocked from running.

This patch is an effort to fix such journal replay deadlock in a simpler
way,
- Add a bool varialbe 'in_replay' in struct journal, set it to true when
   journal replay starts, and set it to false when journal replay
   completed. in_replay is initialized to be false.
- Reserve 6 sectors in journal bucket, do not use them in normal bcache
   runtime. These sectors are only permitted to use during journal
   replay (when c->jouranl.in_replay is true)

Then in normal bcache runtime, journal space won't be totally full and
there are 6 sectors are always reserved for journal replay time. After
system reboots, if bch_btree_insert() in bch_journal_replay() causes
btree split and bch_journal_beta() gets called to require 1 sector
from journal buckets to write an empty jset, there are enough reserved
space to serve.

The reason to reserve 6 sectors is, we should choose a number that won't
fix into a bucket size. If the reserved space happens to be a whole
bucket, more logic has to be added in journal_replay() to handle
journal.blocks_free with reserved spaces in journal replay time. This is
why 6 sectors is choosed, it is 3KB and won't be any proper block size
or bucket size.

The bcache btree node size is quite large, so btree node split won't be
a frequent event. And when btree node split happens, new added key will
be insert directly into uppper level or neighbor nodes and won't go into
journal again, only bch_journal_meta() is called to write jset metadata
which occupies 1 block in journal space. If blocksize is set to 4K size,
reserve 6 sectors indeed is 2 blocks, so there can be two continuously
btree splitting happen during journal replay, this is very very rare in
practice. As default blocksize is set to sector size, that equals to
6 blocks reserved. Contiously splitting the btree for 6 times in journal
replay is almost impossible, so the reserved space seems to be enough
in my humble opinion.

If in future the reserved space turns out to be not enough, let's extend
it then.

Signed-off-by: Coly Li <colyli@xxxxxxx>
---
  drivers/md/bcache/journal.c | 100 ++++++++++++++++++++++++++++++++++++++++----
  drivers/md/bcache/journal.h |   4 ++
  2 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index a6deb16c15c8..c60a702f53a9 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -415,6 +415,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
  	uint64_t start = i->j.last_seq, end = i->j.seq, n = start;
  	struct keylist keylist;
  
+	s->journal.in_replay = true;
+
  	list_for_each_entry(i, list, list) {
  		BUG_ON(i->pin && atomic_read(i->pin) != 1);
  
@@ -448,6 +450,7 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
  	pr_info("journal replay done, %i keys in %i entries, seq %llu",
  		keys, entries, end);
  err:
+	s->journal.in_replay = false;
  	while (!list_empty(list)) {
  		i = list_first_entry(list, struct journal_replay, list);
  		list_del(&i->list);

What about locking here?
Doesn't 'in_replay' need to be guarded by a spinlock or something?

Cheers,

Hannes
--
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)