Re: bcache register deadlock in bcache_bucket_alloc() between bch_uuid_write and bch_prio_write

Rui Hua <flowerui@xxxxxxxxx> · Sat, 11 Feb 2017 16:09:03 +0800

Hi Eric,
Does this problem fixed?
I'm hitting a deadlock which is similar to yours: I backport bcache
code from 4.4.24 to kernel 3.10.0-327.el7.x86_64(CentOS7.2), It has
been working fine many days (writeback mode, the caching device is a
Intel DC S3700 200G, attached 5 HDD disks).
Today, I reboot the system  when running some XFS benchmarks over all
cachedevs, After the system is startup, the task of register
command(echo /dev/sdg > /sys/fs/bcache/register) hang, and the dmesg
info as follow (/dev/sdg is my S3700 SSD):

[Feb10 20:31] INFO: task bash:2615 blocked for more than 120 seconds.
[  +0.000059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000059] bash            D ffff88022f86a060     0  2615   2611 0x00000080
[  +0.000007]  ffff88022e833b28 0000000000000082 ffff88022f985c00
ffff88022e833fd8
[  +0.000006]  ffff88022e833fd8 ffff88022e833fd8 ffff88022f985c00
ffff88022ddc0000
[  +0.000004]  00000000fffffffc ffff88022f86a000 ffff88022e833bb0
ffff88022f86a060
[  +0.000005] Call Trace:
[  +0.000013]  [<ffffffff8163a909>] schedule+0x29/0x70
[  +0.000016]  [<ffffffffa045701d>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[  +0.000010]  [<ffffffffa0456700>] ?
bch_btree_insert_check_key+0x1d0/0x1d0 [bcache]
[  +0.000010]  [<ffffffffa0457131>] bch_btree_insert+0xf1/0x170 [bcache]
[  +0.000006]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[  +0.000011]  [<ffffffffa045bbda>] bch_journal_replay+0x14a/0x260 [bcache]
[  +0.000007]  [<ffffffff8101cd69>] ? sched_clock+0x9/0x10
[  +0.000008]  [<ffffffff810b8a66>] ? try_to_wake_up+0x1b6/0x300
[  +0.000013]  [<ffffffffa04673dd>] run_cache_set+0x56d/0x950 [bcache]
[  +0.000012]  [<ffffffffa0468ec3>] register_bcache+0xde3/0x1370 [bcache]
[  +0.000008]  [<ffffffff812f4d0f>] kobj_attr_store+0xf/0x20
[  +0.000005]  [<ffffffff81259176>] sysfs_write_file+0xc6/0x140
[  +0.000006]  [<ffffffff811de5cd>] vfs_write+0xbd/0x1e0
[  +0.000003]  [<ffffffff811df06f>] SyS_write+0x7f/0xe0
[  +0.000006]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[  +0.000006] INFO: task bcache_allocato:2676 blocked for more than 120 seconds.
[  +0.000054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000057] bcache_allocato D 0000000000000001     0  2676      2 0x00000080
[  +0.000004]  ffff880220edfc78 0000000000000046 ffff8800b8ed0b80
ffff880220edffd8
[  +0.000005]  ffff880220edffd8 ffff880220edffd8 ffff8800b8ed0b80
ffff880220edfce0
[  +0.000004]  ffff8800b8ed0b80 ffff88022ddd8870 ffff8800b9cae320
0000000000000001
[  +0.000004] Call Trace:
[  +0.000005]  [<ffffffff8163a909>] schedule+0x29/0x70
[  +0.000010]  [<ffffffffa0457c05>] closure_sync+0x25/0x90 [bcache]
[  +0.000011]  [<ffffffffa045c4b8>] bch_journal+0x128/0x340 [bcache]
[  +0.000009]  [<ffffffffa045c717>] bch_journal_meta+0x47/0x70 [bcache]
[  +0.000005]  [<ffffffff8163a909>] ? schedule+0x29/0x70
[  +0.000009]  [<ffffffffa0457c05>] ? closure_sync+0x25/0x90 [bcache]
[  +0.000011]  [<ffffffffa0462f5b>] ? prio_io+0xfb/0x170 [bcache]
[  +0.000004]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
[  +0.000011]  [<ffffffffa0465407>] bch_prio_write+0x237/0x340 [bcache]
[  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
[  +0.000009]  [<ffffffffa044d0b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
[  +0.000007]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[  +0.000005]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[  +0.000005]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[  +0.000006]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

The trace info as fallow:
[root@localhost ~]# perf record -e 'bcache:*' -o /tmp/perf.data -a
^C[ perf record: Woken up 1 times to write data ]
[bcache] with build id 4ce18096008788e940370b93dcb21b5711b7bed6 not
found, continuing without symbols
[ perf record: Captured and wrote 0.207 MB /tmp/perf.data (675 samples) ]
[root@localhost ~]#
[root@localhost ~]# cd /tmp/
[root@localhost tmp]# perf script
            bash  2615 [004]    97.000138: bcache:bcache_btree_read:
bucket 186634
            .......
            bash  2615 [004]    98.708044: bcache:bcache_btree_read:
bucket 122988
            bash  2615 [004]    98.711088: bcache:bcache_btree_read:
bucket 270990
            bash  2615 [004]    98.741592:
bcache:bcache_journal_replay_key: 3:978301904 len 16 dirty 1
        kthreadd  2676 [002]    98.756234: bcache:bcache_alloc:
allocated 8,98 sector=1024
        kthreadd  2676 [002]    98.760511: bcache:bcache_alloc:
allocated 8,98 sector=1024
        kthreadd  2676 [002]    98.762685: bcache:bcache_journal_full: (nil)U

====
Looks like deadlock between the SSD register command and
bcache_allocator thread? The trace info tell us the journal is full,
so the allocator thread waits on
bch_prio_write()->prio_io()->bch_journal_meta(), but there is no
RESERVE_BTREE buckets to use for journal replay at this time, so
register waits on bch_journal_replay()->bch_btree_insert()

2016-04-04 9:07 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
> Hello all,
>
> I have a test system that won't register.  I can register either the
> caching dev or the cached dev fine, but as soon as I register the
> sencond dev, bash hangs when echoing into /sys/fs/bcache/register .
> I can register in either order (cache first or cached dev first) and
> the deadlock still presents.
>
> I've narrowed down the problem to these two call stacks:
>
>
> == The allocator thread is one half of the deadlock:
>
> [  405.619895] INFO: task bcache_allocato:3494 blocked for more than 5 seconds.
> [  405.620897]       Tainted: G        W  O    4.1.20+ #5
> [  405.621732] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  405.623132] bcache_allocato D ffff88007b78fc98     0  3494      2 0x00000080
> [  405.624406]  ffff88007b78fc98 ffff88007b78fc68 0000000000000002 ffff88007c8ddb80
> [  405.626241]  ffff8800796e0000 ffff88007b78fc78 ffff88007b790000 0000000000000005
> [  405.627890]  ffff880079cf0028 0000000000000001 0000000000000001 ffff88007b78fcb8
> [  405.629730] Call Trace:
> [  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
> [  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
> [  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
> [  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
> [  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]
> [  405.635496]  [<ffffffffa048de80>] ? invalidate_buckets+0x980/0x980 [bcache]
> [  405.636655]  [<ffffffff810d734e>] kthread+0xfe/0x120
> [  405.637604]  [<ffffffff817b0440>] ? _raw_spin_unlock_irq+0x30/0x50
> [  405.638569]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
> [  405.639628]  [<ffffffff817b11a2>] ret_from_fork+0x42/0x70
> [  405.640570]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
> [  405.641568] no locks held by bcache_allocato/3494.
>
> There is a comment in the code inside of bch_allocator_thread() before it
> calls bch_prio_write(ca):
>         360  /*
>         361   * This could deadlock if an allocation with a btree
>         362   * node locked ever blocked - having the btree node
>         363   * locked would block garbage collection, but here we're
>         364   * waiting on garbage collection before we invalidate
>         365   * and free anything.
>         366   *
>         367   * But this should be safe since the btree code always
>         368   * uses btree_check_reserve() before allocating now, and
>         369   * if it fails it blocks without btree nodes locked.
>         370   */
>         371   if (!fifo_full(&ca->free_inc))
>         372           goto retry_invalidate;
>         373
>         374   bch_prio_write(ca);
>
> I think I'm hitting the deadlock which this comment speaks to.
>
>
> == This is the other side of the deadlock caused by registering the cache device:
>   echo /dev/sdb > /sys/fs/bcache/register
>
> [  405.578073] INFO: task bash:3490 blocked for more than 5 seconds:
> [  405.580255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ ... ]
> [  405.588984]  [<ffffffff817a9f47>] schedule+0x37/0x90
> [  405.589986]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
> [  405.591084]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
> [  405.592078]  [<ffffffffa048eb59>] __bch_bucket_alloc_set+0x109/0x1a0 [bcache]
> [  405.593113]  [<ffffffffa048ec40>] bch_bucket_alloc_set+0x50/0x70 [bcache]
> [  405.594153]  [<ffffffffa04a7927>] __uuid_write+0x67/0x160 [bcache]
> [  405.595223]  [<ffffffffa04a8a06>] bch_uuid_write+0x16/0x40 [bcache]
> [  405.596273]  [<ffffffffa04a9877>] bch_cached_dev_attach+0x157/0x490 [bcache]
> [  405.597384]  [<ffffffffa04a6b68>] ? __write_super+0x148/0x180 [bcache]
> [  405.598432]  [<ffffffffa04a8986>] ? bcache_write_super+0x1d6/0x240 [bcache]
> [  405.599464]  [<ffffffffa04aa761>] run_cache_set+0x601/0x910 [bcache]
> [  405.600548]  [<ffffffffa04ac0ce>] register_bcache+0xeae/0x1430 [bcache]
> [...]
> [  405.611623] 4 locks held by bash/3490:
> [  405.612342]  #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff8126f773>] vfs_write+0x183/0x1b0
> [  405.614200]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff812faaa6>] kernfs_fop_write+0x66/0x1a0
> [  405.615994]  #2:  (s_active#194){.+.+.+}, at: [<ffffffff812faaae>] kernfs_fop_write+0x6e/0x1a0
> [  405.617925]  #3:  (&bch_register_lock){+.+.+.}, at: [<ffffffffa04abe70>] register_bcache+0xc50/0x1430 [bcache]
>
>
> I'm rather perplexed as to why this is deadlocking because
> bch_bucket_alloc_set() locks c->bucket_lock, and the allocator holds
> ca->set->bucket_lock before calling bch_prio_write() .  I checked,
> this is the same lock (same memory position).
>
> This implies that the allocator waits on bch_bucket_alloc_set()
> which was invoked by bash through register_bcache, and register_bcache
> waits on the allocator's call to bch_prio_write().
>
> Things that I've tried which don't work or make the problem worse:
>   * Adding a mutex inside bch_bucket_alloc so that only one may proceed at a time
>   * Holding bch_register_lock in the allocator thread before calling bch_prio_write.
>
> Does anyone else have insight here that might help solve the problem?
>
> -Eric
>
> --
> Eric Wheeler
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html