Re: bcache register deadlock in bcache_bucket_alloc() between bch_uuid_write and bch_prio_write

Rui Hua <flowerui@xxxxxxxxx> · Mon, 13 Feb 2017 10:23:47 +0800

Thanks for reply.

Yes, the patch you mentioned is already in my code.
There is different between this patch and my problem, the problem
which this patch solves is hanging on RESERVE_PRIO buckets alloc,
but in my environment, the allocator thread hang on RESERVE_BTREE
buckets alloc(bch_journal())

The patch calltrace:
[  405.629730] Call Trace:
[  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
[  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
[  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
[  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
[  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]

My calltrace:
[  +0.000002] Call Trace:
[  +0.000003]  [<ffffffff8163a909>] schedule+0x29/0x70
[  +0.000005]  [<ffffffffa0453c05>] closure_sync+0x25/0x90 [bcache]
[  +0.000004]  [<ffffffffa04584b8>] bch_journal+0x128/0x340 [bcache]
[  +0.000005]  [<ffffffffa0458717>] bch_journal_meta+0x47/0x70 [bcache]
[  +0.000002]  [<ffffffff8163a909>] ? schedule+0x29/0x70
[  +0.000004]  [<ffffffffa0453c05>] ? closure_sync+0x25/0x90 [bcache]
[  +0.000005]  [<ffffffffa045ef5b>] ? prio_io+0xfb/0x170 [bcache]
[  +0.000002]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
[  +0.000005]  [<ffffffffa0461407>] bch_prio_write+0x237/0x340 [bcache]
[  +0.000004]  [<ffffffffa0448cf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
[  +0.000004]  [<ffffffffa04490b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]

2017-02-12 7:49 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
> On Sat, 11 Feb 2017, Rui Hua wrote:
>
>> Hi Eric,
>> Does this problem fixed?
>> I'm hitting a deadlock which is similar to yours: I backport bcache
>> code from 4.4.24 to kernel 3.10.0-327.el7.x86_64(CentOS7.2), It has
>> been working fine many days (writeback mode, the caching device is a
>> Intel DC S3700 200G, attached 5 HDD disks).
>
> Maybe try this patch:
>
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2d64cbc819b13ebd503780f39552827516f4ce4a
>
> Its the only one I can think of.  However, it might already be in 4.4.24.
> You would need to check and see if you have this or not.
>
>
> --
> Eric Wheeler
>
>
>> Today, I reboot the system  when running some XFS benchmarks over all
>> cachedevs, After the system is startup, the task of register
>> command(echo /dev/sdg > /sys/fs/bcache/register) hang, and the dmesg
>> info as follow (/dev/sdg is my S3700 SSD):
>>
>> [Feb10 20:31] INFO: task bash:2615 blocked for more than 120 seconds.
>> [  +0.000059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  +0.000059] bash            D ffff88022f86a060     0  2615   2611 0x00000080
>> [  +0.000007]  ffff88022e833b28 0000000000000082 ffff88022f985c00
>> ffff88022e833fd8
>> [  +0.000006]  ffff88022e833fd8 ffff88022e833fd8 ffff88022f985c00
>> ffff88022ddc0000
>> [  +0.000004]  00000000fffffffc ffff88022f86a000 ffff88022e833bb0
>> ffff88022f86a060
>> [  +0.000005] Call Trace:
>> [  +0.000013]  [<ffffffff8163a909>] schedule+0x29/0x70
>> [  +0.000016]  [<ffffffffa045701d>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
>> [  +0.000010]  [<ffffffffa0456700>] ?
>> bch_btree_insert_check_key+0x1d0/0x1d0 [bcache]
>> [  +0.000010]  [<ffffffffa0457131>] bch_btree_insert+0xf1/0x170 [bcache]
>> [  +0.000006]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
>> [  +0.000011]  [<ffffffffa045bbda>] bch_journal_replay+0x14a/0x260 [bcache]
>> [  +0.000007]  [<ffffffff8101cd69>] ? sched_clock+0x9/0x10
>> [  +0.000008]  [<ffffffff810b8a66>] ? try_to_wake_up+0x1b6/0x300
>> [  +0.000013]  [<ffffffffa04673dd>] run_cache_set+0x56d/0x950 [bcache]
>> [  +0.000012]  [<ffffffffa0468ec3>] register_bcache+0xde3/0x1370 [bcache]
>> [  +0.000008]  [<ffffffff812f4d0f>] kobj_attr_store+0xf/0x20
>> [  +0.000005]  [<ffffffff81259176>] sysfs_write_file+0xc6/0x140
>> [  +0.000006]  [<ffffffff811de5cd>] vfs_write+0xbd/0x1e0
>> [  +0.000003]  [<ffffffff811df06f>] SyS_write+0x7f/0xe0
>> [  +0.000006]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
>> [  +0.000006] INFO: task bcache_allocato:2676 blocked for more than 120 seconds.
>> [  +0.000054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  +0.000057] bcache_allocato D 0000000000000001     0  2676      2 0x00000080
>> [  +0.000004]  ffff880220edfc78 0000000000000046 ffff8800b8ed0b80
>> ffff880220edffd8
>> [  +0.000005]  ffff880220edffd8 ffff880220edffd8 ffff8800b8ed0b80
>> ffff880220edfce0
>> [  +0.000004]  ffff8800b8ed0b80 ffff88022ddd8870 ffff8800b9cae320
>> 0000000000000001
>> [  +0.000004] Call Trace:
>> [  +0.000005]  [<ffffffff8163a909>] schedule+0x29/0x70
>> [  +0.000010]  [<ffffffffa0457c05>] closure_sync+0x25/0x90 [bcache]
>> [  +0.000011]  [<ffffffffa045c4b8>] bch_journal+0x128/0x340 [bcache]
>> [  +0.000009]  [<ffffffffa045c717>] bch_journal_meta+0x47/0x70 [bcache]
>> [  +0.000005]  [<ffffffff8163a909>] ? schedule+0x29/0x70
>> [  +0.000009]  [<ffffffffa0457c05>] ? closure_sync+0x25/0x90 [bcache]
>> [  +0.000011]  [<ffffffffa0462f5b>] ? prio_io+0xfb/0x170 [bcache]
>> [  +0.000004]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
>> [  +0.000011]  [<ffffffffa0465407>] bch_prio_write+0x237/0x340 [bcache]
>> [  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
>> [  +0.000009]  [<ffffffffa044d0b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
>> [  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
>> [  +0.000007]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
>> [  +0.000005]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
>> [  +0.000005]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
>> [  +0.000006]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
>>
>> The trace info as fallow:
>> [root@localhost ~]# perf record -e 'bcache:*' -o /tmp/perf.data -a
>> ^C[ perf record: Woken up 1 times to write data ]
>> [bcache] with build id 4ce18096008788e940370b93dcb21b5711b7bed6 not
>> found, continuing without symbols
>> [ perf record: Captured and wrote 0.207 MB /tmp/perf.data (675 samples) ]
>> [root@localhost ~]#
>> [root@localhost ~]# cd /tmp/
>> [root@localhost tmp]# perf script
>>             bash  2615 [004]    97.000138: bcache:bcache_btree_read:
>> bucket 186634
>>             .......
>>             bash  2615 [004]    98.708044: bcache:bcache_btree_read:
>> bucket 122988
>>             bash  2615 [004]    98.711088: bcache:bcache_btree_read:
>> bucket 270990
>>             bash  2615 [004]    98.741592:
>> bcache:bcache_journal_replay_key: 3:978301904 len 16 dirty 1
>>         kthreadd  2676 [002]    98.756234: bcache:bcache_alloc:
>> allocated 8,98 sector=1024
>>         kthreadd  2676 [002]    98.760511: bcache:bcache_alloc:
>> allocated 8,98 sector=1024
>>         kthreadd  2676 [002]    98.762685: bcache:bcache_journal_full: (nil)U
>>
>> ====
>> Looks like deadlock between the SSD register command and
>> bcache_allocator thread? The trace info tell us the journal is full,
>> so the allocator thread waits on
>> bch_prio_write()->prio_io()->bch_journal_meta(), but there is no
>> RESERVE_BTREE buckets to use for journal replay at this time, so
>> register waits on bch_journal_replay()->bch_btree_insert()
>>
>> 2016-04-04 9:07 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
>> > Hello all,
>> >
>> > I have a test system that won't register.  I can register either the
>> > caching dev or the cached dev fine, but as soon as I register the
>> > sencond dev, bash hangs when echoing into /sys/fs/bcache/register .
>> > I can register in either order (cache first or cached dev first) and
>> > the deadlock still presents.
>> >
>> > I've narrowed down the problem to these two call stacks:
>> >
>> >
>> > == The allocator thread is one half of the deadlock:
>> >
>> > [  405.619895] INFO: task bcache_allocato:3494 blocked for more than 5 seconds.
>> > [  405.620897]       Tainted: G        W  O    4.1.20+ #5
>> > [  405.621732] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > [  405.623132] bcache_allocato D ffff88007b78fc98     0  3494      2 0x00000080
>> > [  405.624406]  ffff88007b78fc98 ffff88007b78fc68 0000000000000002 ffff88007c8ddb80
>> > [  405.626241]  ffff8800796e0000 ffff88007b78fc78 ffff88007b790000 0000000000000005
>> > [  405.627890]  ffff880079cf0028 0000000000000001 0000000000000001 ffff88007b78fcb8
>> > [  405.629730] Call Trace:
>> > [  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
>> > [  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
>> > [  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
>> > [  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
>> > [  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]
>> > [  405.635496]  [<ffffffffa048de80>] ? invalidate_buckets+0x980/0x980 [bcache]
>> > [  405.636655]  [<ffffffff810d734e>] kthread+0xfe/0x120
>> > [  405.637604]  [<ffffffff817b0440>] ? _raw_spin_unlock_irq+0x30/0x50
>> > [  405.638569]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
>> > [  405.639628]  [<ffffffff817b11a2>] ret_from_fork+0x42/0x70
>> > [  405.640570]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
>> > [  405.641568] no locks held by bcache_allocato/3494.
>> >
>> > There is a comment in the code inside of bch_allocator_thread() before it
>> > calls bch_prio_write(ca):
>> >         360  /*
>> >         361   * This could deadlock if an allocation with a btree
>> >         362   * node locked ever blocked - having the btree node
>> >         363   * locked would block garbage collection, but here we're
>> >         364   * waiting on garbage collection before we invalidate
>> >         365   * and free anything.
>> >         366   *
>> >         367   * But this should be safe since the btree code always
>> >         368   * uses btree_check_reserve() before allocating now, and
>> >         369   * if it fails it blocks without btree nodes locked.
>> >         370   */
>> >         371   if (!fifo_full(&ca->free_inc))
>> >         372           goto retry_invalidate;
>> >         373
>> >         374   bch_prio_write(ca);
>> >
>> > I think I'm hitting the deadlock which this comment speaks to.
>> >
>> >
>> > == This is the other side of the deadlock caused by registering the cache device:
>> >   echo /dev/sdb > /sys/fs/bcache/register
>> >
>> > [  405.578073] INFO: task bash:3490 blocked for more than 5 seconds:
>> > [  405.580255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > [ ... ]
>> > [  405.588984]  [<ffffffff817a9f47>] schedule+0x37/0x90
>> > [  405.589986]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
>> > [  405.591084]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
>> > [  405.592078]  [<ffffffffa048eb59>] __bch_bucket_alloc_set+0x109/0x1a0 [bcache]
>> > [  405.593113]  [<ffffffffa048ec40>] bch_bucket_alloc_set+0x50/0x70 [bcache]
>> > [  405.594153]  [<ffffffffa04a7927>] __uuid_write+0x67/0x160 [bcache]
>> > [  405.595223]  [<ffffffffa04a8a06>] bch_uuid_write+0x16/0x40 [bcache]
>> > [  405.596273]  [<ffffffffa04a9877>] bch_cached_dev_attach+0x157/0x490 [bcache]
>> > [  405.597384]  [<ffffffffa04a6b68>] ? __write_super+0x148/0x180 [bcache]
>> > [  405.598432]  [<ffffffffa04a8986>] ? bcache_write_super+0x1d6/0x240 [bcache]
>> > [  405.599464]  [<ffffffffa04aa761>] run_cache_set+0x601/0x910 [bcache]
>> > [  405.600548]  [<ffffffffa04ac0ce>] register_bcache+0xeae/0x1430 [bcache]
>> > [...]
>> > [  405.611623] 4 locks held by bash/3490:
>> > [  405.612342]  #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff8126f773>] vfs_write+0x183/0x1b0
>> > [  405.614200]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff812faaa6>] kernfs_fop_write+0x66/0x1a0
>> > [  405.615994]  #2:  (s_active#194){.+.+.+}, at: [<ffffffff812faaae>] kernfs_fop_write+0x6e/0x1a0
>> > [  405.617925]  #3:  (&bch_register_lock){+.+.+.}, at: [<ffffffffa04abe70>] register_bcache+0xc50/0x1430 [bcache]
>> >
>> >
>> > I'm rather perplexed as to why this is deadlocking because
>> > bch_bucket_alloc_set() locks c->bucket_lock, and the allocator holds
>> > ca->set->bucket_lock before calling bch_prio_write() .  I checked,
>> > this is the same lock (same memory position).
>> >
>> > This implies that the allocator waits on bch_bucket_alloc_set()
>> > which was invoked by bash through register_bcache, and register_bcache
>> > waits on the allocator's call to bch_prio_write().
>> >
>> > Things that I've tried which don't work or make the problem worse:
>> >   * Adding a mutex inside bch_bucket_alloc so that only one may proceed at a time
>> >   * Holding bch_register_lock in the allocator thread before calling bch_prio_write.
>> >
>> > Does anyone else have insight here that might help solve the problem?
>> >
>> > -Eric
>> >
>> > --
>> > Eric Wheeler
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html