Re: bcache causes RCU stalls/bcache_gc hogs CPU

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Tue, 14 Apr 2015 13:03:46 -0700 (PDT)

Apply all of the attached patches to your kernel and try again.

I wish somebody would apply these upstream and get it into the official 
kernel. I have been carrying all of these patches with me for some time 
and they definitely make bcache more stable.

-Eric


--
Eric Wheeler, President           eWheeler, Inc. dba Global Linux Security
888-LINUX26 (888-546-8926)        Fax: 503-716-3878           PO Box 25107
www.GlobalLinuxSecurity.pro       Linux since 1996!     Portland, OR 97298

On Tue, 14 Apr 2015, Nikolay Borisov wrote:

> Hello list,
> 
> 
> I'm currently testing bcache with the following setup:
> 
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> 
> sda                   8:0    0   1.8T  0 disk
> ??sda1                8:1    0     2M  0 part
> ??sda2                8:2    0   191M  0 part
> ??sda3                8:3    0   1.8T  0 part
>   ??main-os (dm-0)  254:0    0   1.8T  0 lvm  /
> sdb                   8:16   0 223.1G  0 disk
> ??sdb1                8:17   0 188.2M  0 part /boot
> ??sdb2                8:18   0 222.9G  0 part
>   ??main-ssd (dm-1) 254:1    0    40G  0 lvm
>   ? ??bcache0       253:0    0   182G  0 disk /sdb
>   ??main-db (dm-2)  254:2    0   182G  0 lvm
>     ??bcache0       253:0    0   182G  0 disk /sdb
> 
> So a 40gig ssd (main-ssd, lvm2 volume) backed by a 180gig hdd (main-db, lvm 2
> volume), using the writeback cache policy. Every other setting is at its
> default. I'm running the 4.0-rc6 (!CONFIG_PREEMPT). After running fio (using a
> 30gb file) with a mix of sequential and random i/o and I'm getting the
> following RCU warn:
> 
> ======================================================
> INFO: rcu_sched self-detected stall on CPU
>         4: (2099 ticks this GP) idle=fcf/140000000000001/0
> softirq=1031582/1031582 fqs=2100
> INFO: rcu_sched detected stalls on CPUs/tasks:
>         4: (2099 ticks this GP) idle=fcf/140000000000001/0
> softirq=1031582/1031582 fqs=2100
>         (detected by 16, t=2104 jiffies, g=2176431, c=2176430, q=3098)
> Task dump for CPU 4:
> bcache_gc	R  running task    12728 18115      2 0x00000008
>  ffff880079e85720 fffffffffffffffc ffff88046c180e20 fffffffffffffffc
>  ffffffff81091693 fffffffffffffffc ffff88086aa3d000 ffff88046c180000
>  ffff88086aa3d000 ffff88046c180000 ffff88086aa3d060 ffff88086aa3d000
> Call Trace:
>  [<ffffffff81091693>] ? __wake_up+0x53/0x70
>  [<ffffffffa01103d4>] ? bch_btree_gc+0x2f4/0x560 [bcache]
>  [<ffffffff8100180b>] ? __switch_to+0xbb/0x5f0
>  [<ffffffff810911f0>] ? woken_wake_function+0x20/0x20
>  [<ffffffffa0110678>] ? bch_gc_thread+0x38/0x120 [bcache]
>  [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache]
>  [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache]
>  [<ffffffff81070a9e>] ? kthread+0xce/0xf0
>  [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70
>  [<ffffffff815b8818>] ? ret_from_fork+0x58/0x90
>  [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70
>          (t=2228 jiffies g=2176431 c=2176430 q=3161)
> Task dump for CPU 4:
> bcache_gc	R  running task    12728 18115      2 0x00000008
>  0000000000000005 ffff88046fc83ca8 ffffffff8107720b 0000000000000004
>  ffffffff8183d040 ffff88046fc83cc8 ffffffff810772af ffff88046fc83cc8
>  ffffffff8183d100 ffff88046fc83cf8 ffffffff810a5101 ffff88046fc94500
> Call Trace:
>  <IRQ>  [<ffffffff8107720b>] sched_show_task+0xcb/0x130
>  [<ffffffff810772af>] dump_cpu_task+0x3f/0x50
>  [<ffffffff810a5101>] rcu_dump_cpu_stacks+0x91/0xd0
>  [<ffffffff810a68cf>] rcu_check_callbacks+0x65f/0xc30
>  [<ffffffff81080ecc>] ? account_process_tick+0x6c/0x170
>  [<ffffffff810acf29>] update_process_times+0x39/0x70
>  [<ffffffff810beba0>] tick_sched_handle+0x40/0x50
>  [<ffffffff810bedb2>] tick_sched_timer+0x52/0xa0
>  [<ffffffff810afa16>] __run_hrtimer+0x86/0x1d0
>  [<ffffffff810bed60>] ? tick_nohz_handler+0xc0/0xc0
>  [<ffffffff810afd92>] hrtimer_interrupt+0x102/0x240
>  [<ffffffffa0109920>] ? bch_ptr_invalid+0x10/0x10 [bcache]
>  [<ffffffff81032e79>] local_apic_timer_interrupt+0x39/0x60
>  [<ffffffff815bb355>] smp_apic_timer_interrupt+0x45/0x59
>  [<ffffffffa0109920>] ? bch_ptr_invalid+0x10/0x10 [bcache]
>  [<ffffffff815b972d>] apic_timer_interrupt+0x6d/0x80
>  <EOI>  [<ffffffffa01117c5>] ? __bch_extent_invalid+0xa5/0xd0 [bcache]
>  [<ffffffffa0111721>] ? __bch_extent_invalid+0x1/0xd0 [bcache]
>  [<ffffffffa0111802>] ? bch_extent_invalid+0x12/0x20 [bcache]
>  [<ffffffffa011183d>] bch_extent_bad+0x2d/0x1c0 [bcache]
>  [<ffffffffa010992a>] bch_ptr_bad+0xa/0x10 [bcache]
>  [<ffffffffa01098f9>] bch_btree_iter_next_filter+0x39/0x50 [bcache]
>  [<ffffffffa0109c80>] btree_gc_count_keys+0x50/0x70 [bcache]
>  [<ffffffffa010ffbf>] btree_gc_recurse+0x1bf/0x2e0 [bcache]
>  [<ffffffffa010c4ac>] ? btree_gc_mark_node+0xdc/0x210 [bcache]
>  [<ffffffff81091693>] ? __wake_up+0x53/0x70
>  [<ffffffffa01103d4>] bch_btree_gc+0x2f4/0x560 [bcache]
>  [<ffffffff8100180b>] ? __switch_to+0xbb/0x5f0
>  [<ffffffff810911f0>] ? woken_wake_function+0x20/0x20
>  [<ffffffffa0110678>] bch_gc_thread+0x38/0x120 [bcache]
>  [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache]
>  [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache]
>  [<ffffffff81070a9e>] kthread+0xce/0xf0
>  [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70
>  [<ffffffff815b8818>] ret_from_fork+0x58/0x90
> 
> Naturally, checking
> /sys/fs/bcache/b9bcddd1-7a9a-4f2f-88e6-cb5bef6abcf2/internal/btree_gc_max_duration_ms
> shows: 31593  Clearly at some point the GC overhead becomes so large that it
> causes RCU grace period stalls. I'm puzzled since bch_btree_gc_finish(...) is
> not listed and this is the only function that pertains to bcache gc AND
> executes code in RCU critical read section.
> 
> In addition to that I also observed that the after this RCU stall warn occurs
> the bcache_gc thread hogs the machine at 100% rendering it unusable. I managed
> to get 2 call stack dumps via magic sysrq as follows:
> 
> =============
> NMI backtrace for cpu 4
> CPU: 4 PID: 18115 Comm: bcache_gc Not tainted 4.0.0-rc6bcache1-nikbor #4
> Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> task: ffff88086ab093e0 ti: ffff880868024000 task.ti: ffff880868024000
> RIP: 0010:[<ffffffffa01098cb>]  [<ffffffffa01098cb>]
> bch_btree_iter_next_filter+0xb/0x50 [bcache]
> RSP: 0018:ffff880868027bd8  EFLAGS: 00000202
> RAX: 0000000000000001 RBX: 0000000000002034 RCX: 000000000000000a
> RDX: ffffffffa0109920 RSI: ffff88086aa3dcd0 RDI: ffff880868027c08
> RBP: ffff880868027bf8 R08: 0000000000000001 R09: 0000000000000001
> R10: 000007ffffffffff R11: 0000000000000008 R12: ffff88086aa3dcd0
> R13: ffff880868027c08 R14: ffff880868027cf8 R15: ffff880868027dd8
> FS:  0000000000000000(0000) GS:ffff88046fc80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f4f6b410008 CR3: 000000000180e000 CR4: 00000000001406e0
> Stack:
>  0000000000002034 ffff88086aa3dcd0 ffff880868027c08 ffff880868027cf8
>  ffff880868027c78 ffffffffa0109c80 0000000000000004 0000000000000001
>  ffff8807740101c0 ffff88077402a9f8 ffff88086aa3dc00 ffff88086aa3d000
> Call Trace:
>  [<ffffffffa0109c80>] btree_gc_count_keys+0x50/0x70 [bcache]
>  [<ffffffffa010ffbf>] btree_gc_recurse+0x1bf/0x2e0 [bcache]
>  [<ffffffffa010c4ac>] ? btree_gc_mark_node+0xdc/0x210 [bcache]
>  [<ffffffff81091693>] ? __wake_up+0x53/0x70
>  [<ffffffffa01103d4>] bch_btree_gc+0x2f4/0x560 [bcache]
>  [<ffffffff8100180b>] ? __switch_to+0xbb/0x5f0
>  [<ffffffff810911f0>] ? woken_wake_function+0x20/0x20
>  [<ffffffffa0110678>] bch_gc_thread+0x38/0x120 [bcache]
>  [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache]
>  [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache]
>  [<ffffffff81070a9e>] kthread+0xce/0xf0
>  [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70
>  [<ffffffff815b8818>] ret_from_fork+0x58/0x90
>  [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70
> Code: ff 48 89 d7 4c 29 cf eb a7 48 29 f2 48 89 d6 e9 18 ff ff ff 66 66 66 2e
> 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 <0f> 1f 44 00 00 48
> 89 fb 49 89 f4 49 89 d6 0f 1f 80 00 00 00 00
> 
> 
> ===========================================
> 
> NMI backtrace for cpu 4
> CPU: 4 PID: 18115 Comm: bcache_gc Not tainted 4.0.0-rc6bcache1-nikbor #4
> Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> task: ffff88086ab093e0 ti: ffff880868024000 task.ti: ffff880868024000
> RIP: 0010:[<ffffffffa0111916>]  [<ffffffffa0111916>]
> bch_extent_bad+0x106/0x1c0 [bcache]
> RSP: 0018:ffff880868027ba8  EFLAGS: 00000202
> RAX: 000000000000bd2a RBX: ffff88077400a550 RCX: 000000000000000a
> RDX: 0000000000000004 RSI: 00000000fc390004 RDI: ffff88046c180000
> RBP: ffff880868027bb8 R08: 0000000000000001 R09: 0000000000000000
> R10: 000007ffffffffff R11: 0000000000000008 R12: ffff88086aa3dcd0
> R13: ffff88077400a550 R14: ffffffffa0109920 R15: ffff880868027dd8
> FS:  0000000000000000(0000) GS:ffff88046fc80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f4f6b410008 CR3: 000000000180e000 CR4: 00000000001406e0
> Stack:
>  ffff880868027c08 ffff88086aa3dcd0 ffff880868027bc8 ffffffffa010992a
>  ffff880868027bf8 ffffffffa01098f9 00000000000014a6 ffff88086aa3dcd0
>  ffff880868027c08 ffff880868027cf8 ffff880868027c78 ffffffffa0109c80
> Call Trace:
>  [<ffffffffa010992a>] bch_ptr_bad+0xa/0x10 [bcache]
>  [<ffffffffa01098f9>] bch_btree_iter_next_filter+0x39/0x50 [bcache]
>  [<ffffffffa0109c80>] btree_gc_count_keys+0x50/0x70 [bcache]
>  [<ffffffffa010ffbf>] btree_gc_recurse+0x1bf/0x2e0 [bcache]
>  [<ffffffffa010c4ac>] ? btree_gc_mark_node+0xdc/0x210 [bcache]
>  [<ffffffff81091693>] ? __wake_up+0x53/0x70
>  [<ffffffffa01103d4>] bch_btree_gc+0x2f4/0x560 [bcache]
>  [<ffffffff8100180b>] ? __switch_to+0xbb/0x5f0
>  [<ffffffff810911f0>] ? woken_wake_function+0x20/0x20
>  [<ffffffffa0110678>] bch_gc_thread+0x38/0x120 [bcache]
>  [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache]
>  [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache]
>  [<ffffffff81070a9e>] kthread+0xce/0xf0
>  [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70
>  [<ffffffff815b8818>] ret_from_fork+0x58/0x90
>  [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70
> Code: 33 25 ff 0f 00 00 48 8b 94 c7 40 0c 00 00 48 89 f0 48 8b 92 d8 0a 00 00
> 48 c1 e8 08 4c 21 d0 48 d3 e8 48 8d 04 40 0f b6 54 82 06 <40> 28 f2 80 fa 80
> 0f 87 7e 00 00 00 0f b6 d2 83 fa 60 76 66 31
> 
> 
> In the mean time I'm running the stable 4.0.0 where I observe better results (
> no bcache_gc thread hog but still the occasional stall warn)
> 
> Regards,
> Nikolay
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
From: Zheng Liu <wenqing.lz@xxxxxxxxxx>

In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device.  This commit fixes this.

Signed-off-by: Zheng Liu <wenqing.lz@xxxxxxxxxx>
Tested-by: Joshua Schmid <jschmid@xxxxxxxx>
---
 drivers/md/bcache/super.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4dd2bb7..fdbb211 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2100,8 +2100,10 @@ static int __init bcache_init(void)
 	closure_debug_init();
 
 	bcache_major = register_blkdev(0, "bcache");
-	if (bcache_major < 0)
+	if (bcache_major < 0) {
+		unregister_reboot_notifier(&reboot);
 		return bcache_major;
+	}
 
 	if (!(bcache_wq = create_workqueue("bcache")) ||
 	    !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
From: Joshua Schmid <jschmid@xxxxxxxx>
Subject: [PATCH] fix a leak in bch_cached_dev_run()
Newsgroups: gmane.linux.kernel.bcache.devel
Date: 2015-02-03 11:24:06 GMT (3 weeks, 2 days, 11 hours and 43 minutes ago)

From: Al Viro <viro@xxxxxxxxxxxxxxxxxx>

Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Tested-by: Joshua Schmid <jschmid@xxxxxxxx>
---
 drivers/md/bcache/super.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8c2d657..53f1512 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -880,8 +880,11 @@ void bch_cached_dev_run(struct cached_dev *dc)
 	buf[SB_LABEL_SIZE] = '\0';
 	env[2] = kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf);

-	if (atomic_xchg(&dc->running, 1))
+	if (atomic_xchg(&dc->running, 1)) {
+		kfree(env[1]);
+		kfree(env[2]);
 		return;
+	}

 	if (!d->c &&
 	    BDEV_STATE(&dc->sb) != BDEV_STATE_NONE) {
-- 
2.1.2

From: Joshua Schmid <jschmid@xxxxxxxx>
Subject: [PATCH] bcache: [BUG] clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
Newsgroups: gmane.linux.kernel.bcache.devel
Date: 2015-02-03 11:18:01 GMT (3 weeks, 2 days, 11 hours and 45 minutes ago)

From: Zheng Liu <wenqing.lz@xxxxxxxxxx>

This bug can be reproduced by the following script:

  #!/bin/bash

  bcache_sysfs="/sys/fs/bcache"

  function clear_cache()
  {
  	if [ ! -e $bcache_sysfs ]; then
  		echo "no bcache sysfs"
  		exit
  	fi

  	cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}')
  	sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach"
  	sleep 5
  	sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach"
  }

  for ((i=0;i<10;i++)); do
  	clear_cache
  done

The warning messages look like below:
[  275.948611] ------------[ cut here ]------------
[  275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P        W 
---------------   )
[  275.979253] Hardware name: Tecal RH2285
[  275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache'
[  276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[  276.072643] Pid: 2765, comm: sh Tainted: P        W  ---------------    2.6.32 #1
[  276.089315] Call Trace:
[  276.105801]  [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[  276.122650]  [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[  276.139361]  [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0
[  276.156012]  [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170
[  276.172682]  [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20
[  276.189282]  [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache]
[  276.205993]  [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[  276.222794]  [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[  276.239680]  [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[  276.256594]  [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[  276.273364]  [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[  276.290133]  [<ffffffff811890b1>] ? sys_write+0x51/0x90
[  276.306368]  [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[  276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]---
[  276.338241] ------------[ cut here ]------------
[  276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720
bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P        W  ---------------   )
[  276.386017] Hardware name: Tecal RH2285
[  276.401430] Couldn't create device <-> cache set symlinks
[  276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[  276.465477] Pid: 2765, comm: sh Tainted: P        W  ---------------    2.6.32 #1
[  276.482169] Call Trace:
[  276.498610]  [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[  276.515405]  [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[  276.532059]  [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache]
[  276.548808]  [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[  276.565569]  [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[  276.582418]  [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[  276.599341]  [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[  276.616142]  [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[  276.632607]  [<ffffffff811890b1>] ? sys_write+0x51/0x90
[  276.648671]  [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[  276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]---

We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach()
function when we attach a backing device first time.  After detaching this
backing device, this flag will be true and sysfs_remove_link() isn't called in
bcache_device_unlink().  Then when we attach this backing device again,
sysfs_create_link() will return EEXIST error in bcache_device_link().

So the fix is trival and we clear this flag in bcache_device_link().

Signed-off-by: Zheng Liu <wenqing.lz@xxxxxxxxxx>
Tested-by: Joshua Schmid <jschmid@xxxxxxxx>
---
 drivers/md/bcache/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4dd2bb7..f624ae8 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -708,6 +708,8 @@ static void bcache_device_link(struct bcache_device *d, struct cache_set *c,
 	WARN(sysfs_create_link(&d->kobj, &c->kobj, "cache") ||
 	     sysfs_create_link(&c->kobj, &d->kobj, d->name),
 	     "Couldn't create device <-> cache set symlinks");
+
+	clear_bit(BCACHE_DEV_UNLINK_DONE, &d->flags);
 }

 static void bcache_device_detach(struct bcache_device *d)
-- 
2.1.2

	
From:	Zheng Liu <gnehzuil.liu@xxxxxxxxx>
To:	linux-bcache@xxxxxxxxxxxxxxx
Cc:	Zheng Liu <wenqing.lz@xxxxxxxxxx>, Joshua Schmid <jschmid@xxxxxxxx>, Zhu Yanhai <zhu.yanhai@xxxxxxxxx>, Kent Overstreet <kmo@xxxxxxxxxxxxx>
Subject:	[PATCH v2] bcache: fix a livelock in btree lock
Date:	Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
From: Zheng Liu <wenqing.lz@xxxxxxxxxx>

This commit tries to fix a livelock in bcache.  This livelock might
happen when we causes a huge number of cache misses simultaneously.

When we get a cache miss, bcache will execute the following path.

->cached_dev_make_request()
  ->cached_dev_read()
    ->cached_lookup()
      ->bch->btree_map_keys()
        ->btree_root()  <------------------------
          ->bch_btree_map_keys_recurse()        |
            ->cache_lookup_fn()                 |
              ->cached_dev_cache_miss()         |
                ->bch_btree_insert_check_key() -|
                  [If btree->seq is not equal to seq + 1, we should return
                   EINTR and traverse btree again.]

In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock.  During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others.  Thus no
one can make progress.

This commit will try to take write btree->lock if it encounters a race
when we traverse btree.  Although it sacrifice the scalability but we
can ensure that only one can modify the btree.

Signed-off-by: Zheng Liu <wenqing.lz@xxxxxxxxxx>
Tested-by: Joshua Schmid <jschmid@xxxxxxxx>
Cc: Joshua Schmid <jschmid@xxxxxxxx>
Cc: Zhu Yanhai <zhu.yanhai@xxxxxxxxx>
Cc: Kent Overstreet <kmo@xxxxxxxxxxxxx>
---
changelog:
v2: fix a bug that stops all concurrency writes unconditionally.

 drivers/md/bcache/btree.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 218f21a..43829d9 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -2163,8 +2163,10 @@ int bch_btree_insert_check_key(struct btree *b, struct btree_op *op,
 		rw_lock(true, b, b->level);
 
 		if (b->key.ptr[0] != btree_ptr ||
-		    b->seq != seq + 1)
+                   b->seq != seq + 1) {
+                       op->lock = b->level;
 			goto out;
+               }
 	}
 
 	SET_KEY_PTRS(check_key, 1);
From f0e6320a7874af434575f37a11ec6e4992cef790 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kmo@xxxxxxxxxxxxx>
Date: Sat, 1 Nov 2014 13:44:47 -0700
Subject: [PATCH 1/5] bcache: Add a cond_resched() call to gc
Git-commit: f0e6320a7874af434575f37a11ec6e4992cef790
Patch-mainline: Submitted
References: bnc#910440

Change-id: Id4f18c533b80ddb40df94ed0bb5e2a236a4bc325
Signed-off-by: Takashi Iwai <tiwai@xxxxxxx>

---
 drivers/md/bcache/btree.c | 1 +
  1 file changed, 1 insertion(+)

--- a/drivers/md/bcache/btree.c	2014-11-03 16:51:01.720000000 -0800
+++ b/drivers/md/bcache/btree.c	2014-11-03 16:51:26.456000000 -0800
@@ -1741,6 +1741,7 @@
 	do {
 		ret = btree_root(gc_root, c, &op, &writes, &stats);
 		closure_sync(&writes);
+		cond_resched();
 
 		if (ret && ret != -EAGAIN)
 			pr_warn("gc failed!");