Re: bcache register deadlock in bcache_bucket_alloc() between bch_uuid_write and bch_prio_write

Rui Hua <flowerui@xxxxxxxxx> · Sat, 18 Feb 2017 14:00:42 +0800

I'm not sure whether this problem can be fixed by this patch, because
my environment was damaged before I suggest the patch :-(, and I
could't reproduce this problem again since that time.

But I dig into the code according to the log and trace, find this path
which my register command possibly blocked:
run_cache_set()
  -> bch_journal_replay()
      -> bch_btree_insert()
          -> btree_insert_fn()
              -> bch_btree_insert_node()
                  -> btree_split()
                      -> btree_check_reserve() ----here we find
RESERVE_BTREE buckets is empty, and then schedule out...

And the allocator thread possibly blocked path:
bch_allocator_thread()
  -> bch_prio_write()
     -> bch_journal_meta()
        -> bch_journal()
            -> journal_wait_for_write() ----wait here because journal
full, btree flush also need RESERVE_BTREE bucket?

Therefore, I think filling RESERVE_BTREE early is a possible solution,
although it is not validated

2017-02-18 6:31 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
> On Tue, 14 Feb 2017, Rui Hua wrote:
>
>> Hi, kent & Eric
>> Unfortunately I can't reproduce that problem in my environment. I
>> think this is a solution, is this safe ??
>
> So do you mean that your problem is no longer presenting--or did this
> patch fix your problem?
>
> It looks like a neat solution by filling RESERVE_BTREE early.  I think we
> might want Kent's opinion here.
>
> --
> Eric Wheeler
>
>
>>
>> --- ./linux-4.4.48/drivers/md/bcache/btree.c 2017-02-09 15:02:59.000000000 +0800
>> +++ btree.c 2017-02-14 10:07:56.000000000 +0800
>> @@ -1866,14 +1866,15 @@
>>      */
>>     for_each_cache(ca, c, i) {
>>         for_each_bucket(b, ca) {
>> -           if (fifo_full(&ca->free[RESERVE_PRIO]))
>> +          if (fifo_full(&ca->free[RESERVE_PRIO]) &&
>> +                fifo_full(&ca->free[RESERVE_BTREE]))
>>                 break;
>>
>>             if (bch_can_invalidate_bucket(ca, b) &&
>>                 !GC_MARK(b)) {
>>                 __bch_invalidate_one_bucket(ca, b);
>> -               fifo_push(&ca->free[RESERVE_PRIO],
>> -                     b - ca->buckets);
>> +               if (!fifo_push(&ca->free[RESERVE_PRIO], b - ca->buckets))
>> +                     fifo_push(&ca->free[RESERVE_BTREE], b - ca->buckets);
>>             }
>>         }
>>     }
>>
>> 2017-02-14 9:29 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
>> >
>> > On Mon, 13 Feb 2017, Rui Hua wrote:
>> >
>> >> Thanks for reply.
>> >>
>> >> Yes, the patch you mentioned is already in my code.
>> >> There is different between this patch and my problem, the problem
>> >> which this patch solves is hanging on RESERVE_PRIO buckets alloc,
>> >> but in my environment, the allocator thread hang on RESERVE_BTREE
>> >> buckets alloc(bch_journal())
>> >
>> > Ahh, ok, interesting.  You could try adding _exact to the allocator like
>> > the other patch and see if that fixes it.
>> >
>> > Kent, does this sound safe?
>> >
>> >
>> > --
>> > Eric Wheeler
>> >
>> >>
>> >> The patch calltrace:
>> >> [  405.629730] Call Trace:
>> >> [  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
>> >> [  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
>> >> [  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
>> >> [  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
>> >> [  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]
>> >>
>> >> My calltrace:
>> >> [  +0.000002] Call Trace:
>> >> [  +0.000003]  [<ffffffff8163a909>] schedule+0x29/0x70
>> >> [  +0.000005]  [<ffffffffa0453c05>] closure_sync+0x25/0x90 [bcache]
>> >> [  +0.000004]  [<ffffffffa04584b8>] bch_journal+0x128/0x340 [bcache]
>> >> [  +0.000005]  [<ffffffffa0458717>] bch_journal_meta+0x47/0x70 [bcache]
>> >> [  +0.000002]  [<ffffffff8163a909>] ? schedule+0x29/0x70
>> >> [  +0.000004]  [<ffffffffa0453c05>] ? closure_sync+0x25/0x90 [bcache]
>> >> [  +0.000005]  [<ffffffffa045ef5b>] ? prio_io+0xfb/0x170 [bcache]
>> >> [  +0.000002]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
>> >> [  +0.000005]  [<ffffffffa0461407>] bch_prio_write+0x237/0x340 [bcache]
>> >> [  +0.000004]  [<ffffffffa0448cf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
>> >> [  +0.000004]  [<ffffffffa04490b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
>> >>
>> >>
>> >>
>> >> 2017-02-12 7:49 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
>> >> > On Sat, 11 Feb 2017, Rui Hua wrote:
>> >> >
>> >> >> Hi Eric,
>> >> >> Does this problem fixed?
>> >> >> I'm hitting a deadlock which is similar to yours: I backport bcache
>> >> >> code from 4.4.24 to kernel 3.10.0-327.el7.x86_64(CentOS7.2), It has
>> >> >> been working fine many days (writeback mode, the caching device is a
>> >> >> Intel DC S3700 200G, attached 5 HDD disks).
>> >> >
>> >> > Maybe try this patch:
>> >> >
>> >> > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2d64cbc819b13ebd503780f39552827516f4ce4a
>> >> >
>> >> > Its the only one I can think of.  However, it might already be in 4.4.24.
>> >> > You would need to check and see if you have this or not.
>> >> >
>> >> >
>> >> > --
>> >> > Eric Wheeler
>> >> >
>> >> >
>> >> >> Today, I reboot the system  when running some XFS benchmarks over all
>> >> >> cachedevs, After the system is startup, the task of register
>> >> >> command(echo /dev/sdg > /sys/fs/bcache/register) hang, and the dmesg
>> >> >> info as follow (/dev/sdg is my S3700 SSD):
>> >> >>
>> >> >> [Feb10 20:31] INFO: task bash:2615 blocked for more than 120 seconds.
>> >> >> [  +0.000059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> >> disables this message.
>> >> >> [  +0.000059] bash            D ffff88022f86a060     0  2615   2611 0x00000080
>> >> >> [  +0.000007]  ffff88022e833b28 0000000000000082 ffff88022f985c00
>> >> >> ffff88022e833fd8
>> >> >> [  +0.000006]  ffff88022e833fd8 ffff88022e833fd8 ffff88022f985c00
>> >> >> ffff88022ddc0000
>> >> >> [  +0.000004]  00000000fffffffc ffff88022f86a000 ffff88022e833bb0
>> >> >> ffff88022f86a060
>> >> >> [  +0.000005] Call Trace:
>> >> >> [  +0.000013]  [<ffffffff8163a909>] schedule+0x29/0x70
>> >> >> [  +0.000016]  [<ffffffffa045701d>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
>> >> >> [  +0.000010]  [<ffffffffa0456700>] ?
>> >> >> bch_btree_insert_check_key+0x1d0/0x1d0 [bcache]
>> >> >> [  +0.000010]  [<ffffffffa0457131>] bch_btree_insert+0xf1/0x170 [bcache]
>> >> >> [  +0.000006]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
>> >> >> [  +0.000011]  [<ffffffffa045bbda>] bch_journal_replay+0x14a/0x260 [bcache]
>> >> >> [  +0.000007]  [<ffffffff8101cd69>] ? sched_clock+0x9/0x10
>> >> >> [  +0.000008]  [<ffffffff810b8a66>] ? try_to_wake_up+0x1b6/0x300
>> >> >> [  +0.000013]  [<ffffffffa04673dd>] run_cache_set+0x56d/0x950 [bcache]
>> >> >> [  +0.000012]  [<ffffffffa0468ec3>] register_bcache+0xde3/0x1370 [bcache]
>> >> >> [  +0.000008]  [<ffffffff812f4d0f>] kobj_attr_store+0xf/0x20
>> >> >> [  +0.000005]  [<ffffffff81259176>] sysfs_write_file+0xc6/0x140
>> >> >> [  +0.000006]  [<ffffffff811de5cd>] vfs_write+0xbd/0x1e0
>> >> >> [  +0.000003]  [<ffffffff811df06f>] SyS_write+0x7f/0xe0
>> >> >> [  +0.000006]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
>> >> >> [  +0.000006] INFO: task bcache_allocato:2676 blocked for more than 120 seconds.
>> >> >> [  +0.000054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> >> disables this message.
>> >> >> [  +0.000057] bcache_allocato D 0000000000000001     0  2676      2 0x00000080
>> >> >> [  +0.000004]  ffff880220edfc78 0000000000000046 ffff8800b8ed0b80
>> >> >> ffff880220edffd8
>> >> >> [  +0.000005]  ffff880220edffd8 ffff880220edffd8 ffff8800b8ed0b80
>> >> >> ffff880220edfce0
>> >> >> [  +0.000004]  ffff8800b8ed0b80 ffff88022ddd8870 ffff8800b9cae320
>> >> >> 0000000000000001
>> >> >> [  +0.000004] Call Trace:
>> >> >> [  +0.000005]  [<ffffffff8163a909>] schedule+0x29/0x70
>> >> >> [  +0.000010]  [<ffffffffa0457c05>] closure_sync+0x25/0x90 [bcache]
>> >> >> [  +0.000011]  [<ffffffffa045c4b8>] bch_journal+0x128/0x340 [bcache]
>> >> >> [  +0.000009]  [<ffffffffa045c717>] bch_journal_meta+0x47/0x70 [bcache]
>> >> >> [  +0.000005]  [<ffffffff8163a909>] ? schedule+0x29/0x70
>> >> >> [  +0.000009]  [<ffffffffa0457c05>] ? closure_sync+0x25/0x90 [bcache]
>> >> >> [  +0.000011]  [<ffffffffa0462f5b>] ? prio_io+0xfb/0x170 [bcache]
>> >> >> [  +0.000004]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
>> >> >> [  +0.000011]  [<ffffffffa0465407>] bch_prio_write+0x237/0x340 [bcache]
>> >> >> [  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
>> >> >> [  +0.000009]  [<ffffffffa044d0b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
>> >> >> [  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
>> >> >> [  +0.000007]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
>> >> >> [  +0.000005]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
>> >> >> [  +0.000005]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
>> >> >> [  +0.000006]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
>> >> >>
>> >> >> The trace info as fallow:
>> >> >> [root@localhost ~]# perf record -e 'bcache:*' -o /tmp/perf.data -a
>> >> >> ^C[ perf record: Woken up 1 times to write data ]
>> >> >> [bcache] with build id 4ce18096008788e940370b93dcb21b5711b7bed6 not
>> >> >> found, continuing without symbols
>> >> >> [ perf record: Captured and wrote 0.207 MB /tmp/perf.data (675 samples) ]
>> >> >> [root@localhost ~]#
>> >> >> [root@localhost ~]# cd /tmp/
>> >> >> [root@localhost tmp]# perf script
>> >> >>             bash  2615 [004]    97.000138: bcache:bcache_btree_read:
>> >> >> bucket 186634
>> >> >>             .......
>> >> >>             bash  2615 [004]    98.708044: bcache:bcache_btree_read:
>> >> >> bucket 122988
>> >> >>             bash  2615 [004]    98.711088: bcache:bcache_btree_read:
>> >> >> bucket 270990
>> >> >>             bash  2615 [004]    98.741592:
>> >> >> bcache:bcache_journal_replay_key: 3:978301904 len 16 dirty 1
>> >> >>         kthreadd  2676 [002]    98.756234: bcache:bcache_alloc:
>> >> >> allocated 8,98 sector=1024
>> >> >>         kthreadd  2676 [002]    98.760511: bcache:bcache_alloc:
>> >> >> allocated 8,98 sector=1024
>> >> >>         kthreadd  2676 [002]    98.762685: bcache:bcache_journal_full: (nil)U
>> >> >>
>> >> >> ====
>> >> >> Looks like deadlock between the SSD register command and
>> >> >> bcache_allocator thread? The trace info tell us the journal is full,
>> >> >> so the allocator thread waits on
>> >> >> bch_prio_write()->prio_io()->bch_journal_meta(), but there is no
>> >> >> RESERVE_BTREE buckets to use for journal replay at this time, so
>> >> >> register waits on bch_journal_replay()->bch_btree_insert()
>> >> >>
>> >> >> 2016-04-04 9:07 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
>> >> >> > Hello all,
>> >> >> >
>> >> >> > I have a test system that won't register.  I can register either the
>> >> >> > caching dev or the cached dev fine, but as soon as I register the
>> >> >> > sencond dev, bash hangs when echoing into /sys/fs/bcache/register .
>> >> >> > I can register in either order (cache first or cached dev first) and
>> >> >> > the deadlock still presents.
>> >> >> >
>> >> >> > I've narrowed down the problem to these two call stacks:
>> >> >> >
>> >> >> >
>> >> >> > == The allocator thread is one half of the deadlock:
>> >> >> >
>> >> >> > [  405.619895] INFO: task bcache_allocato:3494 blocked for more than 5 seconds.
>> >> >> > [  405.620897]       Tainted: G        W  O    4.1.20+ #5
>> >> >> > [  405.621732] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> >> > [  405.623132] bcache_allocato D ffff88007b78fc98     0  3494      2 0x00000080
>> >> >> > [  405.624406]  ffff88007b78fc98 ffff88007b78fc68 0000000000000002 ffff88007c8ddb80
>> >> >> > [  405.626241]  ffff8800796e0000 ffff88007b78fc78 ffff88007b790000 0000000000000005
>> >> >> > [  405.627890]  ffff880079cf0028 0000000000000001 0000000000000001 ffff88007b78fcb8
>> >> >> > [  405.629730] Call Trace:
>> >> >> > [  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
>> >> >> > [  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
>> >> >> > [  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
>> >> >> > [  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
>> >> >> > [  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]
>> >> >> > [  405.635496]  [<ffffffffa048de80>] ? invalidate_buckets+0x980/0x980 [bcache]
>> >> >> > [  405.636655]  [<ffffffff810d734e>] kthread+0xfe/0x120
>> >> >> > [  405.637604]  [<ffffffff817b0440>] ? _raw_spin_unlock_irq+0x30/0x50
>> >> >> > [  405.638569]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
>> >> >> > [  405.639628]  [<ffffffff817b11a2>] ret_from_fork+0x42/0x70
>> >> >> > [  405.640570]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
>> >> >> > [  405.641568] no locks held by bcache_allocato/3494.
>> >> >> >
>> >> >> > There is a comment in the code inside of bch_allocator_thread() before it
>> >> >> > calls bch_prio_write(ca):
>> >> >> >         360  /*
>> >> >> >         361   * This could deadlock if an allocation with a btree
>> >> >> >         362   * node locked ever blocked - having the btree node
>> >> >> >         363   * locked would block garbage collection, but here we're
>> >> >> >         364   * waiting on garbage collection before we invalidate
>> >> >> >         365   * and free anything.
>> >> >> >         366   *
>> >> >> >         367   * But this should be safe since the btree code always
>> >> >> >         368   * uses btree_check_reserve() before allocating now, and
>> >> >> >         369   * if it fails it blocks without btree nodes locked.
>> >> >> >         370   */
>> >> >> >         371   if (!fifo_full(&ca->free_inc))
>> >> >> >         372           goto retry_invalidate;
>> >> >> >         373
>> >> >> >         374   bch_prio_write(ca);
>> >> >> >
>> >> >> > I think I'm hitting the deadlock which this comment speaks to.
>> >> >> >
>> >> >> >
>> >> >> > == This is the other side of the deadlock caused by registering the cache device:
>> >> >> >   echo /dev/sdb > /sys/fs/bcache/register
>> >> >> >
>> >> >> > [  405.578073] INFO: task bash:3490 blocked for more than 5 seconds:
>> >> >> > [  405.580255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> >> > [ ... ]
>> >> >> > [  405.588984]  [<ffffffff817a9f47>] schedule+0x37/0x90
>> >> >> > [  405.589986]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
>> >> >> > [  405.591084]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
>> >> >> > [  405.592078]  [<ffffffffa048eb59>] __bch_bucket_alloc_set+0x109/0x1a0 [bcache]
>> >> >> > [  405.593113]  [<ffffffffa048ec40>] bch_bucket_alloc_set+0x50/0x70 [bcache]
>> >> >> > [  405.594153]  [<ffffffffa04a7927>] __uuid_write+0x67/0x160 [bcache]
>> >> >> > [  405.595223]  [<ffffffffa04a8a06>] bch_uuid_write+0x16/0x40 [bcache]
>> >> >> > [  405.596273]  [<ffffffffa04a9877>] bch_cached_dev_attach+0x157/0x490 [bcache]
>> >> >> > [  405.597384]  [<ffffffffa04a6b68>] ? __write_super+0x148/0x180 [bcache]
>> >> >> > [  405.598432]  [<ffffffffa04a8986>] ? bcache_write_super+0x1d6/0x240 [bcache]
>> >> >> > [  405.599464]  [<ffffffffa04aa761>] run_cache_set+0x601/0x910 [bcache]
>> >> >> > [  405.600548]  [<ffffffffa04ac0ce>] register_bcache+0xeae/0x1430 [bcache]
>> >> >> > [...]
>> >> >> > [  405.611623] 4 locks held by bash/3490:
>> >> >> > [  405.612342]  #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff8126f773>] vfs_write+0x183/0x1b0
>> >> >> > [  405.614200]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff812faaa6>] kernfs_fop_write+0x66/0x1a0
>> >> >> > [  405.615994]  #2:  (s_active#194){.+.+.+}, at: [<ffffffff812faaae>] kernfs_fop_write+0x6e/0x1a0
>> >> >> > [  405.617925]  #3:  (&bch_register_lock){+.+.+.}, at: [<ffffffffa04abe70>] register_bcache+0xc50/0x1430 [bcache]
>> >> >> >
>> >> >> >
>> >> >> > I'm rather perplexed as to why this is deadlocking because
>> >> >> > bch_bucket_alloc_set() locks c->bucket_lock, and the allocator holds
>> >> >> > ca->set->bucket_lock before calling bch_prio_write() .  I checked,
>> >> >> > this is the same lock (same memory position).
>> >> >> >
>> >> >> > This implies that the allocator waits on bch_bucket_alloc_set()
>> >> >> > which was invoked by bash through register_bcache, and register_bcache
>> >> >> > waits on the allocator's call to bch_prio_write().
>> >> >> >
>> >> >> > Things that I've tried which don't work or make the problem worse:
>> >> >> >   * Adding a mutex inside bch_bucket_alloc so that only one may proceed at a time
>> >> >> >   * Holding bch_register_lock in the allocator thread before calling bch_prio_write.
>> >> >> >
>> >> >> > Does anyone else have insight here that might help solve the problem?
>> >> >> >
>> >> >> > -Eric
>> >> >> >
>> >> >> > --
>> >> >> > Eric Wheeler
>> >> >> > --
>> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> >> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html