Re: [PATCH] bcache: fix for allocator and register thread race

Coly Li <colyli@xxxxxxx> · Tue, 16 Jan 2018 12:46:31 +0800

On 10/01/2018 4:51 PM, tang.junhui@xxxxxxxxxx wrote:
> From: Tang Junhui <tang.junhui@xxxxxxxxxx>
> 
> After long time run of random small IO writing,
> reboot the machine, and after the machine power on,
> bcache got stuck, the stack is:
> [root@ceph153 ~]# cat /proc/2510/task/*/stack
> [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
> [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
> [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
> [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
> [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
> [<ffffffff810a631f>] kthread+0xcf/0xe0
> [<ffffffff8164c318>] ret_from_fork+0x58/0x90
> [<ffffffffffffffff>] 0xffffffffffffffff
> [root@ceph153 ~]# cat /proc/2038/task/*/stack
> [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
> [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
> [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
> [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
> [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
> [<ffffffff812f702f>] kobj_attr_store+0xf/0x20
> [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
> [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
> [<ffffffff811e069f>] SyS_write+0x7f/0xe0
> [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
> The stack shows the register thread and allocator thread
> were getting stuck when registering cache device.
> 
> we reboot the machine several times, the issue always
> exsit in this machine.
> 
> We debug the code, and found the call trace as bellow:
> register_bcache()
>   ==>run_cache_set()
>      ==>bch_journal_replay()
>         ==>bch_btree_insert()
>            ==>__bch_btree_map_nodes()
>               ==>btree_insert_fn()
>                  ==>btree_split() //node need split
>                     ==>btree_check_reserve()
> In btree_check_reserve(), It will check if there is enough buckets
> of RESERVE_BTREE type, since allocator thread did not work yet, so
> no buckets of RESERVE_BTREE type allocated, so the register thread
> waits on c->btree_cache_wait, and goes to sleep.
> 
> Then the allocator thread initialized, and goes to work,
> the call trace is bellow:
> bch_allocator_thread()
>   ==>bch_prio_write()
>      ==>bch_journal_meta()
>         ==>bch_journal()
>            ==>journal_wait_for_write()
> In journal_wait_for_write(), It will check if journal is full by
> journal_full(), but the long time random small IO writing
> causes the exhaustion of journal buckets(journal.blocks_free=0),
> In order to release the journal buckets,
> the allocator calls btree_flush_write() to flush keys to
> btree nodes, and waits on c->journal.wait until btree nodes writing over
> or there has already some journal buckets space, then the allocator
> thread goes to sleep. but in btree_flush_write(), since
> bch_journal_replay() is not finished, so no btree nodes have journal
> (condition "if (btree_current_write(b)->journal)" never satisfied),
> so we got no btree node to flush, no journal bucket released,
> and allocator sleep all the times.
> 
> Through the above analysis, we can see that:
> 1) Register thread wait for allocator thread to allocate buckets of
>    RESERVE_BTREE type;
> 2) Alloctor thread wait for register thread to replay journal, so it
>    can flush btree nodes and get journal bucket.
>    then they are all got stuck by waiting for each other.
> 
> Hua Rui provided a patch for me, by allocating some buckets of
> RESERVE_BTREE type in advance, so the register thread can get bucket
> when btree node splitting and no need to waiting for the allocator thread.
> I tested it, it has effect, and register thread run a step forward, but
> finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE
> type were allocated, and in bch_journal_replay(), after 2 btree nodes
> splitting, only 4 bucket of RESERVE_BTREE type left, then
> btree_check_reserve() is not satisfied anymore, so it goes to sleep again,
> and in the same time, alloctor thread did not flush enough btree nodes to
> release a journal bucket, so they all got stuck again.
> 
> So we need to allocate more buckets of RESERVE_BTREE type in advance,
> but how much is enough?  By experience and test, I think it should be
> as much as journal buckets. Then I modify the code as this patch,
> and test in the machine, and it works.
> 
> This patch modified base on Hua Rui’s patch, and allocate more buckets
> of RESERVE_BTREE type in advance to avoid register thread and allocate
> thread going to wait for each other.
> 
> Signed-off-by: Hua Rui <huarui.dev@xxxxxxxxx>
> Signed-off-by: Tang Junhui <tang.junhui@xxxxxxxxxx>

Hi Junhui,

It spent some time for me to understand the problem and the fix :-)
The root cause is because bcache journal does not reserve space
previously, and allocates journal slot when btree node split by insert a
node from journal to btree. If we reserve journal space previously, then
we will need something like commit record to make sure the operation is
atomic. That is complicated and I doubt whether bcache is deserved for it.

This fix is much simpler and it makes things work. I am OK with it.

Reviewed-by: Coly Li <colyli@xxxxxxx>

> ---
>  drivers/md/bcache/btree.c |  9 ++++++---
>  drivers/md/bcache/super.c | 12 +++++++++++-
>  2 files changed, 17 insertions(+), 4 deletions(-)
[code snipped]

Coly Li
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html