Re: [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay

Coly Li <colyli@xxxxxxx> · Tue, 23 Apr 2019 15:07:41 +0800

On 2019/4/23 2:59 下午, Hannes Reinecke wrote:
> On 4/19/19 6:04 PM, Coly Li wrote:
>> A deadlock of bcache jouranling may happen during journal replay. Such
>> deadlock happens when,
>> - Journal space is totally full (no any free blocks) and system crashes
>>    or reboots.
>> - After reboot, the first journal entry handled by jouranl replay causes
>>    btree split and jouranl_meta() is called to write an empty jset to
>>    journal space.
>> - There is no journal space to write and journal_reclaim() fails to get
>>    any available bucket because this is the first replayed journal entry
>>    to be blocked.
>> Then the whole cache set is blocked from running.
>>
>> This patch is an effort to fix such journal replay deadlock in a simpler
>> way,
>> - Add a bool varialbe 'in_replay' in struct journal, set it to true when
>>    journal replay starts, and set it to false when journal replay
>>    completed. in_replay is initialized to be false.
>> - Reserve 6 sectors in journal bucket, do not use them in normal bcache
>>    runtime. These sectors are only permitted to use during journal
>>    replay (when c->jouranl.in_replay is true)
>>
>> Then in normal bcache runtime, journal space won't be totally full and
>> there are 6 sectors are always reserved for journal replay time. After
>> system reboots, if bch_btree_insert() in bch_journal_replay() causes
>> btree split and bch_journal_beta() gets called to require 1 sector
>> from journal buckets to write an empty jset, there are enough reserved
>> space to serve.
>>
>> The reason to reserve 6 sectors is, we should choose a number that won't
>> fix into a bucket size. If the reserved space happens to be a whole
>> bucket, more logic has to be added in journal_replay() to handle
>> journal.blocks_free with reserved spaces in journal replay time. This is
>> why 6 sectors is choosed, it is 3KB and won't be any proper block size
>> or bucket size.
>>
>> The bcache btree node size is quite large, so btree node split won't be
>> a frequent event. And when btree node split happens, new added key will
>> be insert directly into uppper level or neighbor nodes and won't go into
>> journal again, only bch_journal_meta() is called to write jset metadata
>> which occupies 1 block in journal space. If blocksize is set to 4K size,
>> reserve 6 sectors indeed is 2 blocks, so there can be two continuously
>> btree splitting happen during journal replay, this is very very rare in
>> practice. As default blocksize is set to sector size, that equals to
>> 6 blocks reserved. Contiously splitting the btree for 6 times in journal
>> replay is almost impossible, so the reserved space seems to be enough
>> in my humble opinion.
>>
>> If in future the reserved space turns out to be not enough, let's extend
>> it then.
>>
>> Signed-off-by: Coly Li <colyli@xxxxxxx>
>> ---
>>   drivers/md/bcache/journal.c | 100
>> ++++++++++++++++++++++++++++++++++++++++----
>>   drivers/md/bcache/journal.h |   4 ++
>>   2 files changed, 97 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index a6deb16c15c8..c60a702f53a9 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -415,6 +415,8 @@ int bch_journal_replay(struct cache_set *s, struct
>> list_head *list)
>>       uint64_t start = i->j.last_seq, end = i->j.seq, n = start;
>>       struct keylist keylist;
>>   +    s->journal.in_replay = true;
>> +
>>       list_for_each_entry(i, list, list) {
>>           BUG_ON(i->pin && atomic_read(i->pin) != 1);
>>   @@ -448,6 +450,7 @@ int bch_journal_replay(struct cache_set *s,
>> struct list_head *list)
>>       pr_info("journal replay done, %i keys in %i entries, seq %llu",
>>           keys, entries, end);
>>   err:
>> +    s->journal.in_replay = false;
>>       while (!list_empty(list)) {
>>           i = list_first_entry(list, struct journal_replay, list);
>>           list_del(&i->list);
> 
> What about locking here?
> Doesn't 'in_replay' need to be guarded by a spinlock or something?

Hi Hannes,

s->journal.in_replay is only set in start up time, where the code is
sequentially executed, so it is sure on concurrent access to
s->journal.in_replay. After the cache runs, s->journal.in_replay is
always false, there is no concurrent access neither.

It is only used to tell the journal code that now it is in journal reply
and the journal-replay-reserved slots can be used. The set and check
sequence is in explicit linear order.

I will add code comments for s->journal.in_replay to make it more clear.

Thanks.

-- 

Coly Li