Re: bcache register deadlock in bcache_bucket_alloc() between bch_uuid_write and bch_prio_write

Rui Hua <flowerui@xxxxxxxxx> · Tue, 14 Feb 2017 10:47:01 +0800

Hi, kent & Eric
Unfortunately I can't reproduce that problem in my environment. I
think this is a solution, is this safe ??

--- ./linux-4.4.48/drivers/md/bcache/btree.c 2017-02-09 15:02:59.000000000 +0800
+++ btree.c 2017-02-14 10:07:56.000000000 +0800
@@ -1866,14 +1866,15 @@
     */
    for_each_cache(ca, c, i) {
        for_each_bucket(b, ca) {
-           if (fifo_full(&ca->free[RESERVE_PRIO]))
+          if (fifo_full(&ca->free[RESERVE_PRIO]) &&
+                fifo_full(&ca->free[RESERVE_BTREE]))
                break;

            if (bch_can_invalidate_bucket(ca, b) &&
                !GC_MARK(b)) {
                __bch_invalidate_one_bucket(ca, b);
-               fifo_push(&ca->free[RESERVE_PRIO],
-                     b - ca->buckets);
+               if (!fifo_push(&ca->free[RESERVE_PRIO], b - ca->buckets))
+                     fifo_push(&ca->free[RESERVE_BTREE], b - ca->buckets);
            }
        }
    }

2017-02-14 9:29 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
>
> On Mon, 13 Feb 2017, Rui Hua wrote:
>
>> Thanks for reply.
>>
>> Yes, the patch you mentioned is already in my code.
>> There is different between this patch and my problem, the problem
>> which this patch solves is hanging on RESERVE_PRIO buckets alloc,
>> but in my environment, the allocator thread hang on RESERVE_BTREE
>> buckets alloc(bch_journal())
>
> Ahh, ok, interesting.  You could try adding _exact to the allocator like
> the other patch and see if that fixes it.
>
> Kent, does this sound safe?
>
>
> --
> Eric Wheeler
>
>>
>> The patch calltrace:
>> [  405.629730] Call Trace:
>> [  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
>> [  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
>> [  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
>> [  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
>> [  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]
>>
>> My calltrace:
>> [  +0.000002] Call Trace:
>> [  +0.000003]  [<ffffffff8163a909>] schedule+0x29/0x70
>> [  +0.000005]  [<ffffffffa0453c05>] closure_sync+0x25/0x90 [bcache]
>> [  +0.000004]  [<ffffffffa04584b8>] bch_journal+0x128/0x340 [bcache]
>> [  +0.000005]  [<ffffffffa0458717>] bch_journal_meta+0x47/0x70 [bcache]
>> [  +0.000002]  [<ffffffff8163a909>] ? schedule+0x29/0x70
>> [  +0.000004]  [<ffffffffa0453c05>] ? closure_sync+0x25/0x90 [bcache]
>> [  +0.000005]  [<ffffffffa045ef5b>] ? prio_io+0xfb/0x170 [bcache]
>> [  +0.000002]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
>> [  +0.000005]  [<ffffffffa0461407>] bch_prio_write+0x237/0x340 [bcache]
>> [  +0.000004]  [<ffffffffa0448cf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
>> [  +0.000004]  [<ffffffffa04490b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
>>
>>
>>
>> 2017-02-12 7:49 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
>> > On Sat, 11 Feb 2017, Rui Hua wrote:
>> >
>> >> Hi Eric,
>> >> Does this problem fixed?
>> >> I'm hitting a deadlock which is similar to yours: I backport bcache
>> >> code from 4.4.24 to kernel 3.10.0-327.el7.x86_64(CentOS7.2), It has
>> >> been working fine many days (writeback mode, the caching device is a
>> >> Intel DC S3700 200G, attached 5 HDD disks).
>> >
>> > Maybe try this patch:
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2d64cbc819b13ebd503780f39552827516f4ce4a
>> >
>> > Its the only one I can think of.  However, it might already be in 4.4.24.
>> > You would need to check and see if you have this or not.
>> >
>> >
>> > --
>> > Eric Wheeler
>> >
>> >
>> >> Today, I reboot the system  when running some XFS benchmarks over all
>> >> cachedevs, After the system is startup, the task of register
>> >> command(echo /dev/sdg > /sys/fs/bcache/register) hang, and the dmesg
>> >> info as follow (/dev/sdg is my S3700 SSD):
>> >>
>> >> [Feb10 20:31] INFO: task bash:2615 blocked for more than 120 seconds.
>> >> [  +0.000059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> disables this message.
>> >> [  +0.000059] bash            D ffff88022f86a060     0  2615   2611 0x00000080
>> >> [  +0.000007]  ffff88022e833b28 0000000000000082 ffff88022f985c00
>> >> ffff88022e833fd8
>> >> [  +0.000006]  ffff88022e833fd8 ffff88022e833fd8 ffff88022f985c00
>> >> ffff88022ddc0000
>> >> [  +0.000004]  00000000fffffffc ffff88022f86a000 ffff88022e833bb0
>> >> ffff88022f86a060
>> >> [  +0.000005] Call Trace:
>> >> [  +0.000013]  [<ffffffff8163a909>] schedule+0x29/0x70
>> >> [  +0.000016]  [<ffffffffa045701d>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
>> >> [  +0.000010]  [<ffffffffa0456700>] ?
>> >> bch_btree_insert_check_key+0x1d0/0x1d0 [bcache]
>> >> [  +0.000010]  [<ffffffffa0457131>] bch_btree_insert+0xf1/0x170 [bcache]
>> >> [  +0.000006]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
>> >> [  +0.000011]  [<ffffffffa045bbda>] bch_journal_replay+0x14a/0x260 [bcache]
>> >> [  +0.000007]  [<ffffffff8101cd69>] ? sched_clock+0x9/0x10
>> >> [  +0.000008]  [<ffffffff810b8a66>] ? try_to_wake_up+0x1b6/0x300
>> >> [  +0.000013]  [<ffffffffa04673dd>] run_cache_set+0x56d/0x950 [bcache]
>> >> [  +0.000012]  [<ffffffffa0468ec3>] register_bcache+0xde3/0x1370 [bcache]
>> >> [  +0.000008]  [<ffffffff812f4d0f>] kobj_attr_store+0xf/0x20
>> >> [  +0.000005]  [<ffffffff81259176>] sysfs_write_file+0xc6/0x140
>> >> [  +0.000006]  [<ffffffff811de5cd>] vfs_write+0xbd/0x1e0
>> >> [  +0.000003]  [<ffffffff811df06f>] SyS_write+0x7f/0xe0
>> >> [  +0.000006]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
>> >> [  +0.000006] INFO: task bcache_allocato:2676 blocked for more than 120 seconds.
>> >> [  +0.000054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> disables this message.
>> >> [  +0.000057] bcache_allocato D 0000000000000001     0  2676      2 0x00000080
>> >> [  +0.000004]  ffff880220edfc78 0000000000000046 ffff8800b8ed0b80
>> >> ffff880220edffd8
>> >> [  +0.000005]  ffff880220edffd8 ffff880220edffd8 ffff8800b8ed0b80
>> >> ffff880220edfce0
>> >> [  +0.000004]  ffff8800b8ed0b80 ffff88022ddd8870 ffff8800b9cae320
>> >> 0000000000000001
>> >> [  +0.000004] Call Trace:
>> >> [  +0.000005]  [<ffffffff8163a909>] schedule+0x29/0x70
>> >> [  +0.000010]  [<ffffffffa0457c05>] closure_sync+0x25/0x90 [bcache]
>> >> [  +0.000011]  [<ffffffffa045c4b8>] bch_journal+0x128/0x340 [bcache]
>> >> [  +0.000009]  [<ffffffffa045c717>] bch_journal_meta+0x47/0x70 [bcache]
>> >> [  +0.000005]  [<ffffffff8163a909>] ? schedule+0x29/0x70
>> >> [  +0.000009]  [<ffffffffa0457c05>] ? closure_sync+0x25/0x90 [bcache]
>> >> [  +0.000011]  [<ffffffffa0462f5b>] ? prio_io+0xfb/0x170 [bcache]
>> >> [  +0.000004]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
>> >> [  +0.000011]  [<ffffffffa0465407>] bch_prio_write+0x237/0x340 [bcache]
>> >> [  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
>> >> [  +0.000009]  [<ffffffffa044d0b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
>> >> [  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
>> >> [  +0.000007]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
>> >> [  +0.000005]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
>> >> [  +0.000005]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
>> >> [  +0.000006]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
>> >>
>> >> The trace info as fallow:
>> >> [root@localhost ~]# perf record -e 'bcache:*' -o /tmp/perf.data -a
>> >> ^C[ perf record: Woken up 1 times to write data ]
>> >> [bcache] with build id 4ce18096008788e940370b93dcb21b5711b7bed6 not
>> >> found, continuing without symbols
>> >> [ perf record: Captured and wrote 0.207 MB /tmp/perf.data (675 samples) ]
>> >> [root@localhost ~]#
>> >> [root@localhost ~]# cd /tmp/
>> >> [root@localhost tmp]# perf script
>> >>             bash  2615 [004]    97.000138: bcache:bcache_btree_read:
>> >> bucket 186634
>> >>             .......
>> >>             bash  2615 [004]    98.708044: bcache:bcache_btree_read:
>> >> bucket 122988
>> >>             bash  2615 [004]    98.711088: bcache:bcache_btree_read:
>> >> bucket 270990
>> >>             bash  2615 [004]    98.741592:
>> >> bcache:bcache_journal_replay_key: 3:978301904 len 16 dirty 1
>> >>         kthreadd  2676 [002]    98.756234: bcache:bcache_alloc:
>> >> allocated 8,98 sector=1024
>> >>         kthreadd  2676 [002]    98.760511: bcache:bcache_alloc:
>> >> allocated 8,98 sector=1024
>> >>         kthreadd  2676 [002]    98.762685: bcache:bcache_journal_full: (nil)U
>> >>
>> >> ====
>> >> Looks like deadlock between the SSD register command and
>> >> bcache_allocator thread? The trace info tell us the journal is full,
>> >> so the allocator thread waits on
>> >> bch_prio_write()->prio_io()->bch_journal_meta(), but there is no
>> >> RESERVE_BTREE buckets to use for journal replay at this time, so
>> >> register waits on bch_journal_replay()->bch_btree_insert()
>> >>
>> >> 2016-04-04 9:07 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
>> >> > Hello all,
>> >> >
>> >> > I have a test system that won't register.  I can register either the
>> >> > caching dev or the cached dev fine, but as soon as I register the
>> >> > sencond dev, bash hangs when echoing into /sys/fs/bcache/register .
>> >> > I can register in either order (cache first or cached dev first) and
>> >> > the deadlock still presents.
>> >> >
>> >> > I've narrowed down the problem to these two call stacks:
>> >> >
>> >> >
>> >> > == The allocator thread is one half of the deadlock:
>> >> >
>> >> > [  405.619895] INFO: task bcache_allocato:3494 blocked for more than 5 seconds.
>> >> > [  405.620897]       Tainted: G        W  O    4.1.20+ #5
>> >> > [  405.621732] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> > [  405.623132] bcache_allocato D ffff88007b78fc98     0  3494      2 0x00000080
>> >> > [  405.624406]  ffff88007b78fc98 ffff88007b78fc68 0000000000000002 ffff88007c8ddb80
>> >> > [  405.626241]  ffff8800796e0000 ffff88007b78fc78 ffff88007b790000 0000000000000005
>> >> > [  405.627890]  ffff880079cf0028 0000000000000001 0000000000000001 ffff88007b78fcb8
>> >> > [  405.629730] Call Trace:
>> >> > [  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
>> >> > [  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
>> >> > [  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
>> >> > [  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
>> >> > [  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]
>> >> > [  405.635496]  [<ffffffffa048de80>] ? invalidate_buckets+0x980/0x980 [bcache]
>> >> > [  405.636655]  [<ffffffff810d734e>] kthread+0xfe/0x120
>> >> > [  405.637604]  [<ffffffff817b0440>] ? _raw_spin_unlock_irq+0x30/0x50
>> >> > [  405.638569]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
>> >> > [  405.639628]  [<ffffffff817b11a2>] ret_from_fork+0x42/0x70
>> >> > [  405.640570]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
>> >> > [  405.641568] no locks held by bcache_allocato/3494.
>> >> >
>> >> > There is a comment in the code inside of bch_allocator_thread() before it
>> >> > calls bch_prio_write(ca):
>> >> >         360  /*
>> >> >         361   * This could deadlock if an allocation with a btree
>> >> >         362   * node locked ever blocked - having the btree node
>> >> >         363   * locked would block garbage collection, but here we're
>> >> >         364   * waiting on garbage collection before we invalidate
>> >> >         365   * and free anything.
>> >> >         366   *
>> >> >         367   * But this should be safe since the btree code always
>> >> >         368   * uses btree_check_reserve() before allocating now, and
>> >> >         369   * if it fails it blocks without btree nodes locked.
>> >> >         370   */
>> >> >         371   if (!fifo_full(&ca->free_inc))
>> >> >         372           goto retry_invalidate;
>> >> >         373
>> >> >         374   bch_prio_write(ca);
>> >> >
>> >> > I think I'm hitting the deadlock which this comment speaks to.
>> >> >
>> >> >
>> >> > == This is the other side of the deadlock caused by registering the cache device:
>> >> >   echo /dev/sdb > /sys/fs/bcache/register
>> >> >
>> >> > [  405.578073] INFO: task bash:3490 blocked for more than 5 seconds:
>> >> > [  405.580255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> > [ ... ]
>> >> > [  405.588984]  [<ffffffff817a9f47>] schedule+0x37/0x90
>> >> > [  405.589986]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
>> >> > [  405.591084]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
>> >> > [  405.592078]  [<ffffffffa048eb59>] __bch_bucket_alloc_set+0x109/0x1a0 [bcache]
>> >> > [  405.593113]  [<ffffffffa048ec40>] bch_bucket_alloc_set+0x50/0x70 [bcache]
>> >> > [  405.594153]  [<ffffffffa04a7927>] __uuid_write+0x67/0x160 [bcache]
>> >> > [  405.595223]  [<ffffffffa04a8a06>] bch_uuid_write+0x16/0x40 [bcache]
>> >> > [  405.596273]  [<ffffffffa04a9877>] bch_cached_dev_attach+0x157/0x490 [bcache]
>> >> > [  405.597384]  [<ffffffffa04a6b68>] ? __write_super+0x148/0x180 [bcache]
>> >> > [  405.598432]  [<ffffffffa04a8986>] ? bcache_write_super+0x1d6/0x240 [bcache]
>> >> > [  405.599464]  [<ffffffffa04aa761>] run_cache_set+0x601/0x910 [bcache]
>> >> > [  405.600548]  [<ffffffffa04ac0ce>] register_bcache+0xeae/0x1430 [bcache]
>> >> > [...]
>> >> > [  405.611623] 4 locks held by bash/3490:
>> >> > [  405.612342]  #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff8126f773>] vfs_write+0x183/0x1b0
>> >> > [  405.614200]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff812faaa6>] kernfs_fop_write+0x66/0x1a0
>> >> > [  405.615994]  #2:  (s_active#194){.+.+.+}, at: [<ffffffff812faaae>] kernfs_fop_write+0x6e/0x1a0
>> >> > [  405.617925]  #3:  (&bch_register_lock){+.+.+.}, at: [<ffffffffa04abe70>] register_bcache+0xc50/0x1430 [bcache]
>> >> >
>> >> >
>> >> > I'm rather perplexed as to why this is deadlocking because
>> >> > bch_bucket_alloc_set() locks c->bucket_lock, and the allocator holds
>> >> > ca->set->bucket_lock before calling bch_prio_write() .  I checked,
>> >> > this is the same lock (same memory position).
>> >> >
>> >> > This implies that the allocator waits on bch_bucket_alloc_set()
>> >> > which was invoked by bash through register_bcache, and register_bcache
>> >> > waits on the allocator's call to bch_prio_write().
>> >> >
>> >> > Things that I've tried which don't work or make the problem worse:
>> >> >   * Adding a mutex inside bch_bucket_alloc so that only one may proceed at a time
>> >> >   * Holding bch_register_lock in the allocator thread before calling bch_prio_write.
>> >> >
>> >> > Does anyone else have insight here that might help solve the problem?
>> >> >
>> >> > -Eric
>> >> >
>> >> > --
>> >> > Eric Wheeler
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html