Hi Eric, Does this problem fixed? I'm hitting a deadlock which is similar to yours: I backport bcache code from 4.4.24 to kernel 3.10.0-327.el7.x86_64(CentOS7.2), It has been working fine many days (writeback mode, the caching device is a Intel DC S3700 200G, attached 5 HDD disks). Today, I reboot the system when running some XFS benchmarks over all cachedevs, After the system is startup, the task of register command(echo /dev/sdg > /sys/fs/bcache/register) hang, and the dmesg info as follow (/dev/sdg is my S3700 SSD): [Feb10 20:31] INFO: task bash:2615 blocked for more than 120 seconds. [ +0.000059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ +0.000059] bash D ffff88022f86a060 0 2615 2611 0x00000080 [ +0.000007] ffff88022e833b28 0000000000000082 ffff88022f985c00 ffff88022e833fd8 [ +0.000006] ffff88022e833fd8 ffff88022e833fd8 ffff88022f985c00 ffff88022ddc0000 [ +0.000004] 00000000fffffffc ffff88022f86a000 ffff88022e833bb0 ffff88022f86a060 [ +0.000005] Call Trace: [ +0.000013] [<ffffffff8163a909>] schedule+0x29/0x70 [ +0.000016] [<ffffffffa045701d>] __bch_btree_map_nodes+0x12d/0x150 [bcache] [ +0.000010] [<ffffffffa0456700>] ? bch_btree_insert_check_key+0x1d0/0x1d0 [bcache] [ +0.000010] [<ffffffffa0457131>] bch_btree_insert+0xf1/0x170 [bcache] [ +0.000006] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [ +0.000011] [<ffffffffa045bbda>] bch_journal_replay+0x14a/0x260 [bcache] [ +0.000007] [<ffffffff8101cd69>] ? sched_clock+0x9/0x10 [ +0.000008] [<ffffffff810b8a66>] ? try_to_wake_up+0x1b6/0x300 [ +0.000013] [<ffffffffa04673dd>] run_cache_set+0x56d/0x950 [bcache] [ +0.000012] [<ffffffffa0468ec3>] register_bcache+0xde3/0x1370 [bcache] [ +0.000008] [<ffffffff812f4d0f>] kobj_attr_store+0xf/0x20 [ +0.000005] [<ffffffff81259176>] sysfs_write_file+0xc6/0x140 [ +0.000006] [<ffffffff811de5cd>] vfs_write+0xbd/0x1e0 [ +0.000003] [<ffffffff811df06f>] SyS_write+0x7f/0xe0 [ +0.000006] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [ +0.000006] INFO: task bcache_allocato:2676 blocked for more than 120 seconds. [ +0.000054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ +0.000057] bcache_allocato D 0000000000000001 0 2676 2 0x00000080 [ +0.000004] ffff880220edfc78 0000000000000046 ffff8800b8ed0b80 ffff880220edffd8 [ +0.000005] ffff880220edffd8 ffff880220edffd8 ffff8800b8ed0b80 ffff880220edfce0 [ +0.000004] ffff8800b8ed0b80 ffff88022ddd8870 ffff8800b9cae320 0000000000000001 [ +0.000004] Call Trace: [ +0.000005] [<ffffffff8163a909>] schedule+0x29/0x70 [ +0.000010] [<ffffffffa0457c05>] closure_sync+0x25/0x90 [bcache] [ +0.000011] [<ffffffffa045c4b8>] bch_journal+0x128/0x340 [bcache] [ +0.000009] [<ffffffffa045c717>] bch_journal_meta+0x47/0x70 [bcache] [ +0.000005] [<ffffffff8163a909>] ? schedule+0x29/0x70 [ +0.000009] [<ffffffffa0457c05>] ? closure_sync+0x25/0x90 [bcache] [ +0.000011] [<ffffffffa0462f5b>] ? prio_io+0xfb/0x170 [bcache] [ +0.000004] [<ffffffff81638b42>] ? mutex_lock+0x12/0x2f [ +0.000011] [<ffffffffa0465407>] bch_prio_write+0x237/0x340 [bcache] [ +0.000008] [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache] [ +0.000009] [<ffffffffa044d0b8>] bch_allocator_thread+0x3c8/0x3d0 [bcache] [ +0.000008] [<ffffffffa044ccf0>] ? invalidate_buckets+0x9c0/0x9c0 [bcache] [ +0.000007] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [ +0.000005] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [ +0.000005] [<ffffffff81645858>] ret_from_fork+0x58/0x90 [ +0.000006] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 The trace info as fallow: [root@localhost ~]# perf record -e 'bcache:*' -o /tmp/perf.data -a ^C[ perf record: Woken up 1 times to write data ] [bcache] with build id 4ce18096008788e940370b93dcb21b5711b7bed6 not found, continuing without symbols [ perf record: Captured and wrote 0.207 MB /tmp/perf.data (675 samples) ] [root@localhost ~]# [root@localhost ~]# cd /tmp/ [root@localhost tmp]# perf script bash 2615 [004] 97.000138: bcache:bcache_btree_read: bucket 186634 ....... bash 2615 [004] 98.708044: bcache:bcache_btree_read: bucket 122988 bash 2615 [004] 98.711088: bcache:bcache_btree_read: bucket 270990 bash 2615 [004] 98.741592: bcache:bcache_journal_replay_key: 3:978301904 len 16 dirty 1 kthreadd 2676 [002] 98.756234: bcache:bcache_alloc: allocated 8,98 sector=1024 kthreadd 2676 [002] 98.760511: bcache:bcache_alloc: allocated 8,98 sector=1024 kthreadd 2676 [002] 98.762685: bcache:bcache_journal_full: (nil)U ==== Looks like deadlock between the SSD register command and bcache_allocator thread? The trace info tell us the journal is full, so the allocator thread waits on bch_prio_write()->prio_io()->bch_journal_meta(), but there is no RESERVE_BTREE buckets to use for journal replay at this time, so register waits on bch_journal_replay()->bch_btree_insert() 2016-04-04 9:07 GMT+08:00 Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>: > Hello all, > > I have a test system that won't register. I can register either the > caching dev or the cached dev fine, but as soon as I register the > sencond dev, bash hangs when echoing into /sys/fs/bcache/register . > I can register in either order (cache first or cached dev first) and > the deadlock still presents. > > I've narrowed down the problem to these two call stacks: > > > == The allocator thread is one half of the deadlock: > > [ 405.619895] INFO: task bcache_allocato:3494 blocked for more than 5 seconds. > [ 405.620897] Tainted: G W O 4.1.20+ #5 > [ 405.621732] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 405.623132] bcache_allocato D ffff88007b78fc98 0 3494 2 0x00000080 > [ 405.624406] ffff88007b78fc98 ffff88007b78fc68 0000000000000002 ffff88007c8ddb80 > [ 405.626241] ffff8800796e0000 ffff88007b78fc78 ffff88007b790000 0000000000000005 > [ 405.627890] ffff880079cf0028 0000000000000001 0000000000000001 ffff88007b78fcb8 > [ 405.629730] Call Trace: > [ 405.630362] [<ffffffff817a9f47>] schedule+0x37/0x90 > [ 405.631230] [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache] > [ 405.632261] [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110 > [ 405.633274] [<ffffffffa04a90d5>] bch_prio_write+0x1b5/0x390 [bcache] > [ 405.634362] [<ffffffffa048e19d>] bch_allocator_thread+0x31d/0x4c0 [bcache] > [ 405.635496] [<ffffffffa048de80>] ? invalidate_buckets+0x980/0x980 [bcache] > [ 405.636655] [<ffffffff810d734e>] kthread+0xfe/0x120 > [ 405.637604] [<ffffffff817b0440>] ? _raw_spin_unlock_irq+0x30/0x50 > [ 405.638569] [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240 > [ 405.639628] [<ffffffff817b11a2>] ret_from_fork+0x42/0x70 > [ 405.640570] [<ffffffff810d7250>] ? kthread_create_on_node+0x240/0x240 > [ 405.641568] no locks held by bcache_allocato/3494. > > There is a comment in the code inside of bch_allocator_thread() before it > calls bch_prio_write(ca): > 360 /* > 361 * This could deadlock if an allocation with a btree > 362 * node locked ever blocked - having the btree node > 363 * locked would block garbage collection, but here we're > 364 * waiting on garbage collection before we invalidate > 365 * and free anything. > 366 * > 367 * But this should be safe since the btree code always > 368 * uses btree_check_reserve() before allocating now, and > 369 * if it fails it blocks without btree nodes locked. > 370 */ > 371 if (!fifo_full(&ca->free_inc)) > 372 goto retry_invalidate; > 373 > 374 bch_prio_write(ca); > > I think I'm hitting the deadlock which this comment speaks to. > > > == This is the other side of the deadlock caused by registering the cache device: > echo /dev/sdb > /sys/fs/bcache/register > > [ 405.578073] INFO: task bash:3490 blocked for more than 5 seconds: > [ 405.580255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ ... ] > [ 405.588984] [<ffffffff817a9f47>] schedule+0x37/0x90 > [ 405.589986] [<ffffffffa048e4f0>] bch_bucket_alloc+0x1b0/0x670 [bcache] > [ 405.591084] [<ffffffff81103860>] ? prepare_to_wait_event+0x110/0x110 > [ 405.592078] [<ffffffffa048eb59>] __bch_bucket_alloc_set+0x109/0x1a0 [bcache] > [ 405.593113] [<ffffffffa048ec40>] bch_bucket_alloc_set+0x50/0x70 [bcache] > [ 405.594153] [<ffffffffa04a7927>] __uuid_write+0x67/0x160 [bcache] > [ 405.595223] [<ffffffffa04a8a06>] bch_uuid_write+0x16/0x40 [bcache] > [ 405.596273] [<ffffffffa04a9877>] bch_cached_dev_attach+0x157/0x490 [bcache] > [ 405.597384] [<ffffffffa04a6b68>] ? __write_super+0x148/0x180 [bcache] > [ 405.598432] [<ffffffffa04a8986>] ? bcache_write_super+0x1d6/0x240 [bcache] > [ 405.599464] [<ffffffffa04aa761>] run_cache_set+0x601/0x910 [bcache] > [ 405.600548] [<ffffffffa04ac0ce>] register_bcache+0xeae/0x1430 [bcache] > [...] > [ 405.611623] 4 locks held by bash/3490: > [ 405.612342] #0: (sb_writers#3){.+.+.+}, at: [<ffffffff8126f773>] vfs_write+0x183/0x1b0 > [ 405.614200] #1: (&of->mutex){+.+.+.}, at: [<ffffffff812faaa6>] kernfs_fop_write+0x66/0x1a0 > [ 405.615994] #2: (s_active#194){.+.+.+}, at: [<ffffffff812faaae>] kernfs_fop_write+0x6e/0x1a0 > [ 405.617925] #3: (&bch_register_lock){+.+.+.}, at: [<ffffffffa04abe70>] register_bcache+0xc50/0x1430 [bcache] > > > I'm rather perplexed as to why this is deadlocking because > bch_bucket_alloc_set() locks c->bucket_lock, and the allocator holds > ca->set->bucket_lock before calling bch_prio_write() . I checked, > this is the same lock (same memory position). > > This implies that the allocator waits on bch_bucket_alloc_set() > which was invoked by bash through register_bcache, and register_bcache > waits on the allocator's call to bch_prio_write(). > > Things that I've tried which don't work or make the problem worse: > * Adding a mutex inside bch_bucket_alloc so that only one may proceed at a time > * Holding bch_register_lock in the allocator thread before calling bch_prio_write. > > Does anyone else have insight here that might help solve the problem? > > -Eric > > -- > Eric Wheeler > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html