On 2019/4/23 2:59 下午, Hannes Reinecke wrote: > On 4/19/19 6:04 PM, Coly Li wrote: >> A deadlock of bcache jouranling may happen during journal replay. Such >> deadlock happens when, >> - Journal space is totally full (no any free blocks) and system crashes >> or reboots. >> - After reboot, the first journal entry handled by jouranl replay causes >> btree split and jouranl_meta() is called to write an empty jset to >> journal space. >> - There is no journal space to write and journal_reclaim() fails to get >> any available bucket because this is the first replayed journal entry >> to be blocked. >> Then the whole cache set is blocked from running. >> >> This patch is an effort to fix such journal replay deadlock in a simpler >> way, >> - Add a bool varialbe 'in_replay' in struct journal, set it to true when >> journal replay starts, and set it to false when journal replay >> completed. in_replay is initialized to be false. >> - Reserve 6 sectors in journal bucket, do not use them in normal bcache >> runtime. These sectors are only permitted to use during journal >> replay (when c->jouranl.in_replay is true) >> >> Then in normal bcache runtime, journal space won't be totally full and >> there are 6 sectors are always reserved for journal replay time. After >> system reboots, if bch_btree_insert() in bch_journal_replay() causes >> btree split and bch_journal_beta() gets called to require 1 sector >> from journal buckets to write an empty jset, there are enough reserved >> space to serve. >> >> The reason to reserve 6 sectors is, we should choose a number that won't >> fix into a bucket size. If the reserved space happens to be a whole >> bucket, more logic has to be added in journal_replay() to handle >> journal.blocks_free with reserved spaces in journal replay time. This is >> why 6 sectors is choosed, it is 3KB and won't be any proper block size >> or bucket size. >> >> The bcache btree node size is quite large, so btree node split won't be >> a frequent event. And when btree node split happens, new added key will >> be insert directly into uppper level or neighbor nodes and won't go into >> journal again, only bch_journal_meta() is called to write jset metadata >> which occupies 1 block in journal space. If blocksize is set to 4K size, >> reserve 6 sectors indeed is 2 blocks, so there can be two continuously >> btree splitting happen during journal replay, this is very very rare in >> practice. As default blocksize is set to sector size, that equals to >> 6 blocks reserved. Contiously splitting the btree for 6 times in journal >> replay is almost impossible, so the reserved space seems to be enough >> in my humble opinion. >> >> If in future the reserved space turns out to be not enough, let's extend >> it then. >> >> Signed-off-by: Coly Li <colyli@xxxxxxx> >> --- >> drivers/md/bcache/journal.c | 100 >> ++++++++++++++++++++++++++++++++++++++++---- >> drivers/md/bcache/journal.h | 4 ++ >> 2 files changed, 97 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c >> index a6deb16c15c8..c60a702f53a9 100644 >> --- a/drivers/md/bcache/journal.c >> +++ b/drivers/md/bcache/journal.c >> @@ -415,6 +415,8 @@ int bch_journal_replay(struct cache_set *s, struct >> list_head *list) >> uint64_t start = i->j.last_seq, end = i->j.seq, n = start; >> struct keylist keylist; >> + s->journal.in_replay = true; >> + >> list_for_each_entry(i, list, list) { >> BUG_ON(i->pin && atomic_read(i->pin) != 1); >> @@ -448,6 +450,7 @@ int bch_journal_replay(struct cache_set *s, >> struct list_head *list) >> pr_info("journal replay done, %i keys in %i entries, seq %llu", >> keys, entries, end); >> err: >> + s->journal.in_replay = false; >> while (!list_empty(list)) { >> i = list_first_entry(list, struct journal_replay, list); >> list_del(&i->list); > > What about locking here? > Doesn't 'in_replay' need to be guarded by a spinlock or something? Hi Hannes, s->journal.in_replay is only set in start up time, where the code is sequentially executed, so it is sure on concurrent access to s->journal.in_replay. After the cache runs, s->journal.in_replay is always false, there is no concurrent access neither. It is only used to tell the journal code that now it is in journal reply and the journal-replay-reserved slots can be used. The set and check sequence is in explicit linear order. I will add code comments for s->journal.in_replay to make it more clear. Thanks. -- Coly Li