Re: bcache register deadlock in bcache_bucket_alloc() between bch_uuid_write and bch_prio_write

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Fri, 17 Feb 2017 14:31:13 -0800 (PST)

On Tue, 14 Feb 2017, Rui Hua wrote:

> Hi, kent & Eric
> Unfortunately I can't reproduce that problem in my environment. I
> think this is a solution, is this safe ??

So do you mean that your problem is no longer presenting--or did this 
patch fix your problem?

It looks like a neat solution by filling RESERVE_BTREE early.  I think we 
might want Kent's opinion here.

--
Eric Wheeler

> 
> --- ./linux-4.4.48/drivers/md/bcache/btree.c 2017-02-09 15:02:59.000000000 +0800
> +++ btree.c 2017-02-14 10:07:56.000000000 +0800
> @@ -1866,14 +1866,15 @@
>      */
>     for_each_cache(ca, c, i) {
>         for_each_bucket(b, ca) {
> -           if (fifo_full(&ca->free[RESERVE_PRIO]))
> +          if (fifo_full(&ca->free[RESERVE_PRIO]) &&
> +                fifo_full(&ca->free[RESERVE_BTREE]))
>                 break;
> 
>             if (bch_can_invalidate_bucket(ca, b) &&
>                 !GC_MARK(b)) {
>                 __bch_invalidate_one_bucket(ca, b);
> -               fifo_push(&ca->free[RESERVE_PRIO],
> -                     b - ca->buckets);
> +               if (!fifo_push(&ca->free[RESERVE_PRIO], b - ca->buckets))
> +                     fifo_push(&ca->free[RESERVE_BTREE], b - ca->buckets);
>             }
>         }
>     }
> 
> 2017-02-14 9:29 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
> >
> > On Mon, 13 Feb 2017, Rui Hua wrote:
> >
> >> Thanks for reply.
> >>
> >> Yes, the patch you mentioned is already in my code.
> >> There is different between this patch and my problem, the problem
> >> which this patch solves is hanging on RESERVE_PRIO buckets alloc,
> >> but in my environment, the allocator thread hang on RESERVE_BTREE
> >> buckets alloc(bch_journal())
> >
> > Ahh, ok, interesting.  You could try adding _exact to the allocator like
> > the other patch and see if that fixes it.
> >
> > Kent, does this sound safe?
> >
> >
> > --
> > Eric Wheeler
> >
> >>
> >> The patch calltrace:
> >> [  405.629730] Call Trace:
> >> [  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
> >> [  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
> >> [  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
> >> [  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
> >> [  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]
> >>
> >> My calltrace:
> >> [  +0.000002] Call Trace:
> >> [  +0.000003]  [<ffffffff8163a909>] schedule+0x29/0x70
> >> [  +0.000005]  [<ffffffffa0453c05>] closure_sync+0x25/0x90 [bcache]
> >> [  +0.000004]  [<ffffffffa04584b8>] bch_journal+0x128/0x340 [bcache]
> >> [  +0.000005]  [<ffffffffa0458717>] bch_journal_meta+0x47/0x70 [bcache]
> >> [  +0.000002]  [<ffffffff8163a909>] ? schedule+0x29/0x70
> >> [  +0.000004]  [<ffffffffa0453c05>] ? closure_sync+0x25/0x90 [bcache]
> >> [  +0.000005]  [<ffffffffa045ef5b>] ? prio_io+0xfb/0x170 [bcache]
> >> [  +0.000002]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
> >> [  +0.000005]  [<ffffffffa0461407>] bch_prio_write+0x237/0x340 [bcache]
> >> [  +0.000004]  [<ffffffffa0448cf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
> >> [  +0.000004]  [<ffffffffa04490b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
> >>
> >>
> >>
> >> 2017-02-12 7:49 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
> >> > On Sat, 11 Feb 2017, Rui Hua wrote:
> >> >
> >> >> Hi Eric,
> >> >> Does this problem fixed?
> >> >> I'm hitting a deadlock which is similar to yours: I backport bcache
> >> >> code from 4.4.24 to kernel 3.10.0-327.el7.x86_64(CentOS7.2), It has
> >> >> been working fine many days (writeback mode, the caching device is a
> >> >> Intel DC S3700 200G, attached 5 HDD disks).
> >> >
> >> > Maybe try this patch:
> >> >
> >> > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2d64cbc819b13ebd503780f39552827516f4ce4a
> >> >
> >> > Its the only one I can think of.  However, it might already be in 4.4.24.
> >> > You would need to check and see if you have this or not.
> >> >
> >> >
> >> > --
> >> > Eric Wheeler
> >> >
> >> >
> >> >> Today, I reboot the system  when running some XFS benchmarks over all
> >> >> cachedevs, After the system is startup, the task of register
> >> >> command(echo /dev/sdg > /sys/fs/bcache/register) hang, and the dmesg
> >> >> info as follow (/dev/sdg is my S3700 SSD):
> >> >>
> >> >> [Feb10 20:31] INFO: task bash:2615 blocked for more than 120 seconds.
> >> >> [  +0.000059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> >> disables this message.
> >> >> [  +0.000059] bash            D ffff88022f86a060     0  2615   2611 0x00000080
> >> >> [  +0.000007]  ffff88022e833b28 0000000000000082 ffff88022f985c00
> >> >> ffff88022e833fd8
> >> >> [  +0.000006]  ffff88022e833fd8 ffff88022e833fd8 ffff88022f985c00
> >> >> ffff88022ddc0000
> >> >> [  +0.000004]  00000000fffffffc ffff88022f86a000 ffff88022e833bb0
> >> >> ffff88022f86a060
> >> >> [  +0.000005] Call Trace:
> >> >> [  +0.000013]  [<ffffffff8163a909>] schedule+0x29/0x70
> >> >> [  +0.000016]  [<ffffffffa045701d>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
> >> >> [  +0.000010]  [<ffffffffa0456700>] ?
> >> >> bch_btree_insert_check_key+0x1d0/0x1d0 [bcache]
> >> >> [  +0.000010]  [<ffffffffa0457131>] bch_btree_insert+0xf1/0x170 [bcache]
> >> >> [  +0.000006]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
> >> >> [  +0.000011]  [<ffffffffa045bbda>] bch_journal_replay+0x14a/0x260 [bcache]
> >> >> [  +0.000007]  [<ffffffff8101cd69>] ? sched_clock+0x9/0x10
> >> >> [  +0.000008]  [<ffffffff810b8a66>] ? try_to_wake_up+0x1b6/0x300
> >> >> [  +0.000013]  [<ffffffffa04673dd>] run_cache_set+0x56d/0x950 [bcache]
> >> >> [  +0.000012]  [<ffffffffa0468ec3>] register_bcache+0xde3/0x1370 [bcache]
> >> >> [  +0.000008]  [<ffffffff812f4d0f>] kobj_attr_store+0xf/0x20
> >> >> [  +0.000005]  [<ffffffff81259176>] sysfs_write_file+0xc6/0x140
> >> >> [  +0.000006]  [<ffffffff811de5cd>] vfs_write+0xbd/0x1e0
> >> >> [  +0.000003]  [<ffffffff811df06f>] SyS_write+0x7f/0xe0
> >> >> [  +0.000006]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
> >> >> [  +0.000006] INFO: task bcache_allocato:2676 blocked for more than 120 seconds.
> >> >> [  +0.000054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> >> disables this message.
> >> >> [  +0.000057] bcache_allocato D 0000000000000001     0  2676      2 0x00000080
> >> >> [  +0.000004]  ffff880220edfc78 0000000000000046 ffff8800b8ed0b80
> >> >> ffff880220edffd8
> >> >> [  +0.000005]  ffff880220edffd8 ffff880220edffd8 ffff8800b8ed0b80
> >> >> ffff880220edfce0
> >> >> [  +0.000004]  ffff8800b8ed0b80 ffff88022ddd8870 ffff8800b9cae320
> >> >> 0000000000000001
> >> >> [  +0.000004] Call Trace:
> >> >> [  +0.000005]  [<ffffffff8163a909>] schedule+0x29/0x70
> >> >> [  +0.000010]  [<ffffffffa0457c05>] closure_sync+0x25/0x90 [bcache]
> >> >> [  +0.000011]  [<ffffffffa045c4b8>] bch_journal+0x128/0x340 [bcache]
> >> >> [  +0.000009]  [<ffffffffa045c717>] bch_journal_meta+0x47/0x70 [bcache]
> >> >> [  +0.000005]  [<ffffffff8163a909>] ? schedule+0x29/0x70
> >> >> [  +0.000009]  [<ffffffffa0457c05>] ? closure_sync+0x25/0x90 [bcache]
> >> >> [  +0.000011]  [<ffffffffa0462f5b>] ? prio_io+0xfb/0x170 [bcache]
> >> >> [  +0.000004]  [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f
> >> >> [  +0.000011]  [<ffffffffa0465407>] bch_prio_write+0x237/0x340 [bcache]
> >> >> [  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
> >> >> [  +0.000009]  [<ffffffffa044d0b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
> >> >> [  +0.000008]  [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache]
> >> >> [  +0.000007]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
> >> >> [  +0.000005]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
> >> >> [  +0.000005]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
> >> >> [  +0.000006]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
> >> >>
> >> >> The trace info as fallow:
> >> >> [root@localhost ~]# perf record -e 'bcache:*' -o /tmp/perf.data -a
> >> >> ^C[ perf record: Woken up 1 times to write data ]
> >> >> [bcache] with build id 4ce18096008788e940370b93dcb21b5711b7bed6 not
> >> >> found, continuing without symbols
> >> >> [ perf record: Captured and wrote 0.207 MB /tmp/perf.data (675 samples) ]
> >> >> [root@localhost ~]#
> >> >> [root@localhost ~]# cd /tmp/
> >> >> [root@localhost tmp]# perf script
> >> >>             bash  2615 [004]    97.000138: bcache:bcache_btree_read:
> >> >> bucket 186634
> >> >>             .......
> >> >>             bash  2615 [004]    98.708044: bcache:bcache_btree_read:
> >> >> bucket 122988
> >> >>             bash  2615 [004]    98.711088: bcache:bcache_btree_read:
> >> >> bucket 270990
> >> >>             bash  2615 [004]    98.741592:
> >> >> bcache:bcache_journal_replay_key: 3:978301904 len 16 dirty 1
> >> >>         kthreadd  2676 [002]    98.756234: bcache:bcache_alloc:
> >> >> allocated 8,98 sector=1024
> >> >>         kthreadd  2676 [002]    98.760511: bcache:bcache_alloc:
> >> >> allocated 8,98 sector=1024
> >> >>         kthreadd  2676 [002]    98.762685: bcache:bcache_journal_full: (nil)U
> >> >>
> >> >> ====
> >> >> Looks like deadlock between the SSD register command and
> >> >> bcache_allocator thread? The trace info tell us the journal is full,
> >> >> so the allocator thread waits on
> >> >> bch_prio_write()->prio_io()->bch_journal_meta(), but there is no
> >> >> RESERVE_BTREE buckets to use for journal replay at this time, so
> >> >> register waits on bch_journal_replay()->bch_btree_insert()
> >> >>
> >> >> 2016-04-04 9:07 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>:
> >> >> > Hello all,
> >> >> >
> >> >> > I have a test system that won't register.  I can register either the
> >> >> > caching dev or the cached dev fine, but as soon as I register the
> >> >> > sencond dev, bash hangs when echoing into /sys/fs/bcache/register .
> >> >> > I can register in either order (cache first or cached dev first) and
> >> >> > the deadlock still presents.
> >> >> >
> >> >> > I've narrowed down the problem to these two call stacks:
> >> >> >
> >> >> >
> >> >> > == The allocator thread is one half of the deadlock:
> >> >> >
> >> >> > [  405.619895] INFO: task bcache_allocato:3494 blocked for more than 5 seconds.
> >> >> > [  405.620897]       Tainted: G        W  O    4.1.20+ #5
> >> >> > [  405.621732] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> >> > [  405.623132] bcache_allocato D ffff88007b78fc98     0  3494      2 0x00000080
> >> >> > [  405.624406]  ffff88007b78fc98 ffff88007b78fc68 0000000000000002 ffff88007c8ddb80
> >> >> > [  405.626241]  ffff8800796e0000 ffff88007b78fc78 ffff88007b790000 0000000000000005
> >> >> > [  405.627890]  ffff880079cf0028 0000000000000001 0000000000000001 ffff88007b78fcb8
> >> >> > [  405.629730] Call Trace:
> >> >> > [  405.630362]  [<ffffffff817a9f47>] schedule+0x37/0x90
> >> >> > [  405.631230]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
> >> >> > [  405.632261]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
> >> >> > [  405.633274]  [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache]
> >> >> > [  405.634362]  [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache]
> >> >> > [  405.635496]  [<ffffffffa048de80>] ? invalidate_buckets+0x980/0x980 [bcache]
> >> >> > [  405.636655]  [<ffffffff810d734e>] kthread+0xfe/0x120
> >> >> > [  405.637604]  [<ffffffff817b0440>] ? _raw_spin_unlock_irq+0x30/0x50
> >> >> > [  405.638569]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
> >> >> > [  405.639628]  [<ffffffff817b11a2>] ret_from_fork+0x42/0x70
> >> >> > [  405.640570]  [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240
> >> >> > [  405.641568] no locks held by bcache_allocato/3494.
> >> >> >
> >> >> > There is a comment in the code inside of bch_allocator_thread() before it
> >> >> > calls bch_prio_write(ca):
> >> >> >         360  /*
> >> >> >         361   * This could deadlock if an allocation with a btree
> >> >> >         362   * node locked ever blocked - having the btree node
> >> >> >         363   * locked would block garbage collection, but here we're
> >> >> >         364   * waiting on garbage collection before we invalidate
> >> >> >         365   * and free anything.
> >> >> >         366   *
> >> >> >         367   * But this should be safe since the btree code always
> >> >> >         368   * uses btree_check_reserve() before allocating now, and
> >> >> >         369   * if it fails it blocks without btree nodes locked.
> >> >> >         370   */
> >> >> >         371   if (!fifo_full(&ca->free_inc))
> >> >> >         372           goto retry_invalidate;
> >> >> >         373
> >> >> >         374   bch_prio_write(ca);
> >> >> >
> >> >> > I think I'm hitting the deadlock which this comment speaks to.
> >> >> >
> >> >> >
> >> >> > == This is the other side of the deadlock caused by registering the cache device:
> >> >> >   echo /dev/sdb > /sys/fs/bcache/register
> >> >> >
> >> >> > [  405.578073] INFO: task bash:3490 blocked for more than 5 seconds:
> >> >> > [  405.580255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> >> > [ ... ]
> >> >> > [  405.588984]  [<ffffffff817a9f47>] schedule+0x37/0x90
> >> >> > [  405.589986]  [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache]
> >> >> > [  405.591084]  [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110
> >> >> > [  405.592078]  [<ffffffffa048eb59>] __bch_bucket_alloc_set+0x109/0x1a0 [bcache]
> >> >> > [  405.593113]  [<ffffffffa048ec40>] bch_bucket_alloc_set+0x50/0x70 [bcache]
> >> >> > [  405.594153]  [<ffffffffa04a7927>] __uuid_write+0x67/0x160 [bcache]
> >> >> > [  405.595223]  [<ffffffffa04a8a06>] bch_uuid_write+0x16/0x40 [bcache]
> >> >> > [  405.596273]  [<ffffffffa04a9877>] bch_cached_dev_attach+0x157/0x490 [bcache]
> >> >> > [  405.597384]  [<ffffffffa04a6b68>] ? __write_super+0x148/0x180 [bcache]
> >> >> > [  405.598432]  [<ffffffffa04a8986>] ? bcache_write_super+0x1d6/0x240 [bcache]
> >> >> > [  405.599464]  [<ffffffffa04aa761>] run_cache_set+0x601/0x910 [bcache]
> >> >> > [  405.600548]  [<ffffffffa04ac0ce>] register_bcache+0xeae/0x1430 [bcache]
> >> >> > [...]
> >> >> > [  405.611623] 4 locks held by bash/3490:
> >> >> > [  405.612342]  #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff8126f773>] vfs_write+0x183/0x1b0
> >> >> > [  405.614200]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff812faaa6>] kernfs_fop_write+0x66/0x1a0
> >> >> > [  405.615994]  #2:  (s_active#194){.+.+.+}, at: [<ffffffff812faaae>] kernfs_fop_write+0x6e/0x1a0
> >> >> > [  405.617925]  #3:  (&bch_register_lock){+.+.+.}, at: [<ffffffffa04abe70>] register_bcache+0xc50/0x1430 [bcache]
> >> >> >
> >> >> >
> >> >> > I'm rather perplexed as to why this is deadlocking because
> >> >> > bch_bucket_alloc_set() locks c->bucket_lock, and the allocator holds
> >> >> > ca->set->bucket_lock before calling bch_prio_write() .  I checked,
> >> >> > this is the same lock (same memory position).
> >> >> >
> >> >> > This implies that the allocator waits on bch_bucket_alloc_set()
> >> >> > which was invoked by bash through register_bcache, and register_bcache
> >> >> > waits on the allocator's call to bch_prio_write().
> >> >> >
> >> >> > Things that I've tried which don't work or make the problem worse:
> >> >> >   * Adding a mutex inside bch_bucket_alloc so that only one may proceed at a time
> >> >> >   * Holding bch_register_lock in the allocator thread before calling bch_prio_write.
> >> >> >
> >> >> > Does anyone else have insight here that might help solve the problem?
> >> >> >
> >> >> > -Eric
> >> >> >
> >> >> > --
> >> >> > Eric Wheeler
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> >> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html