On 10/01/2018 4:51 PM, tang.junhui@xxxxxxxxxx wrote: > From: Tang Junhui <tang.junhui@xxxxxxxxxx> > > After long time run of random small IO writing, > reboot the machine, and after the machine power on, > bcache got stuck, the stack is: > [root@ceph153 ~]# cat /proc/2510/task/*/stack > [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache] > [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache] > [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache] > [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache] > [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache] > [<ffffffff810a631f>] kthread+0xcf/0xe0 > [<ffffffff8164c318>] ret_from_fork+0x58/0x90 > [<ffffffffffffffff>] 0xffffffffffffffff > [root@ceph153 ~]# cat /proc/2038/task/*/stack > [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache] > [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache] > [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache] > [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache] > [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache] > [<ffffffff812f702f>] kobj_attr_store+0xf/0x20 > [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140 > [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0 > [<ffffffff811e069f>] SyS_write+0x7f/0xe0 > [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1 > The stack shows the register thread and allocator thread > were getting stuck when registering cache device. > > we reboot the machine several times, the issue always > exsit in this machine. > > We debug the code, and found the call trace as bellow: > register_bcache() > ==>run_cache_set() > ==>bch_journal_replay() > ==>bch_btree_insert() > ==>__bch_btree_map_nodes() > ==>btree_insert_fn() > ==>btree_split() //node need split > ==>btree_check_reserve() > In btree_check_reserve(), It will check if there is enough buckets > of RESERVE_BTREE type, since allocator thread did not work yet, so > no buckets of RESERVE_BTREE type allocated, so the register thread > waits on c->btree_cache_wait, and goes to sleep. > > Then the allocator thread initialized, and goes to work, > the call trace is bellow: > bch_allocator_thread() > ==>bch_prio_write() > ==>bch_journal_meta() > ==>bch_journal() > ==>journal_wait_for_write() > In journal_wait_for_write(), It will check if journal is full by > journal_full(), but the long time random small IO writing > causes the exhaustion of journal buckets(journal.blocks_free=0), > In order to release the journal buckets, > the allocator calls btree_flush_write() to flush keys to > btree nodes, and waits on c->journal.wait until btree nodes writing over > or there has already some journal buckets space, then the allocator > thread goes to sleep. but in btree_flush_write(), since > bch_journal_replay() is not finished, so no btree nodes have journal > (condition "if (btree_current_write(b)->journal)" never satisfied), > so we got no btree node to flush, no journal bucket released, > and allocator sleep all the times. > > Through the above analysis, we can see that: > 1) Register thread wait for allocator thread to allocate buckets of > RESERVE_BTREE type; > 2) Alloctor thread wait for register thread to replay journal, so it > can flush btree nodes and get journal bucket. > then they are all got stuck by waiting for each other. > > Hua Rui provided a patch for me, by allocating some buckets of > RESERVE_BTREE type in advance, so the register thread can get bucket > when btree node splitting and no need to waiting for the allocator thread. > I tested it, it has effect, and register thread run a step forward, but > finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE > type were allocated, and in bch_journal_replay(), after 2 btree nodes > splitting, only 4 bucket of RESERVE_BTREE type left, then > btree_check_reserve() is not satisfied anymore, so it goes to sleep again, > and in the same time, alloctor thread did not flush enough btree nodes to > release a journal bucket, so they all got stuck again. > > So we need to allocate more buckets of RESERVE_BTREE type in advance, > but how much is enough? By experience and test, I think it should be > as much as journal buckets. Then I modify the code as this patch, > and test in the machine, and it works. > > This patch modified base on Hua Rui’s patch, and allocate more buckets > of RESERVE_BTREE type in advance to avoid register thread and allocate > thread going to wait for each other. > > Signed-off-by: Hua Rui <huarui.dev@xxxxxxxxx> > Signed-off-by: Tang Junhui <tang.junhui@xxxxxxxxxx> Hi Junhui, It spent some time for me to understand the problem and the fix :-) The root cause is because bcache journal does not reserve space previously, and allocates journal slot when btree node split by insert a node from journal to btree. If we reserve journal space previously, then we will need something like commit record to make sure the operation is atomic. That is complicated and I doubt whether bcache is deserved for it. This fix is much simpler and it makes things work. I am OK with it. Reviewed-by: Coly Li <colyli@xxxxxxx> > --- > drivers/md/bcache/btree.c | 9 ++++++--- > drivers/md/bcache/super.c | 12 +++++++++++- > 2 files changed, 17 insertions(+), 4 deletions(-) [code snipped] Coly Li