Apply all of the attached patches to your kernel and try again. I wish somebody would apply these upstream and get it into the official kernel. I have been carrying all of these patches with me for some time and they definitely make bcache more stable. -Eric -- Eric Wheeler, President eWheeler, Inc. dba Global Linux Security 888-LINUX26 (888-546-8926) Fax: 503-716-3878 PO Box 25107 www.GlobalLinuxSecurity.pro Linux since 1996! Portland, OR 97298 On Tue, 14 Apr 2015, Nikolay Borisov wrote: > Hello list, > > > I'm currently testing bcache with the following setup: > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > sda 8:0 0 1.8T 0 disk > ??sda1 8:1 0 2M 0 part > ??sda2 8:2 0 191M 0 part > ??sda3 8:3 0 1.8T 0 part > ??main-os (dm-0) 254:0 0 1.8T 0 lvm / > sdb 8:16 0 223.1G 0 disk > ??sdb1 8:17 0 188.2M 0 part /boot > ??sdb2 8:18 0 222.9G 0 part > ??main-ssd (dm-1) 254:1 0 40G 0 lvm > ? ??bcache0 253:0 0 182G 0 disk /sdb > ??main-db (dm-2) 254:2 0 182G 0 lvm > ??bcache0 253:0 0 182G 0 disk /sdb > > So a 40gig ssd (main-ssd, lvm2 volume) backed by a 180gig hdd (main-db, lvm 2 > volume), using the writeback cache policy. Every other setting is at its > default. I'm running the 4.0-rc6 (!CONFIG_PREEMPT). After running fio (using a > 30gb file) with a mix of sequential and random i/o and I'm getting the > following RCU warn: > > ====================================================== > INFO: rcu_sched self-detected stall on CPU > 4: (2099 ticks this GP) idle=fcf/140000000000001/0 > softirq=1031582/1031582 fqs=2100 > INFO: rcu_sched detected stalls on CPUs/tasks: > 4: (2099 ticks this GP) idle=fcf/140000000000001/0 > softirq=1031582/1031582 fqs=2100 > (detected by 16, t=2104 jiffies, g=2176431, c=2176430, q=3098) > Task dump for CPU 4: > bcache_gc R running task 12728 18115 2 0x00000008 > ffff880079e85720 fffffffffffffffc ffff88046c180e20 fffffffffffffffc > ffffffff81091693 fffffffffffffffc ffff88086aa3d000 ffff88046c180000 > ffff88086aa3d000 ffff88046c180000 ffff88086aa3d060 ffff88086aa3d000 > Call Trace: > [<ffffffff81091693>] ? __wake_up+0x53/0x70 > [<ffffffffa01103d4>] ? bch_btree_gc+0x2f4/0x560 [bcache] > [<ffffffff8100180b>] ? __switch_to+0xbb/0x5f0 > [<ffffffff810911f0>] ? woken_wake_function+0x20/0x20 > [<ffffffffa0110678>] ? bch_gc_thread+0x38/0x120 [bcache] > [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache] > [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache] > [<ffffffff81070a9e>] ? kthread+0xce/0xf0 > [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff815b8818>] ? ret_from_fork+0x58/0x90 > [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70 > (t=2228 jiffies g=2176431 c=2176430 q=3161) > Task dump for CPU 4: > bcache_gc R running task 12728 18115 2 0x00000008 > 0000000000000005 ffff88046fc83ca8 ffffffff8107720b 0000000000000004 > ffffffff8183d040 ffff88046fc83cc8 ffffffff810772af ffff88046fc83cc8 > ffffffff8183d100 ffff88046fc83cf8 ffffffff810a5101 ffff88046fc94500 > Call Trace: > <IRQ> [<ffffffff8107720b>] sched_show_task+0xcb/0x130 > [<ffffffff810772af>] dump_cpu_task+0x3f/0x50 > [<ffffffff810a5101>] rcu_dump_cpu_stacks+0x91/0xd0 > [<ffffffff810a68cf>] rcu_check_callbacks+0x65f/0xc30 > [<ffffffff81080ecc>] ? account_process_tick+0x6c/0x170 > [<ffffffff810acf29>] update_process_times+0x39/0x70 > [<ffffffff810beba0>] tick_sched_handle+0x40/0x50 > [<ffffffff810bedb2>] tick_sched_timer+0x52/0xa0 > [<ffffffff810afa16>] __run_hrtimer+0x86/0x1d0 > [<ffffffff810bed60>] ? tick_nohz_handler+0xc0/0xc0 > [<ffffffff810afd92>] hrtimer_interrupt+0x102/0x240 > [<ffffffffa0109920>] ? bch_ptr_invalid+0x10/0x10 [bcache] > [<ffffffff81032e79>] local_apic_timer_interrupt+0x39/0x60 > [<ffffffff815bb355>] smp_apic_timer_interrupt+0x45/0x59 > [<ffffffffa0109920>] ? bch_ptr_invalid+0x10/0x10 [bcache] > [<ffffffff815b972d>] apic_timer_interrupt+0x6d/0x80 > <EOI> [<ffffffffa01117c5>] ? __bch_extent_invalid+0xa5/0xd0 [bcache] > [<ffffffffa0111721>] ? __bch_extent_invalid+0x1/0xd0 [bcache] > [<ffffffffa0111802>] ? bch_extent_invalid+0x12/0x20 [bcache] > [<ffffffffa011183d>] bch_extent_bad+0x2d/0x1c0 [bcache] > [<ffffffffa010992a>] bch_ptr_bad+0xa/0x10 [bcache] > [<ffffffffa01098f9>] bch_btree_iter_next_filter+0x39/0x50 [bcache] > [<ffffffffa0109c80>] btree_gc_count_keys+0x50/0x70 [bcache] > [<ffffffffa010ffbf>] btree_gc_recurse+0x1bf/0x2e0 [bcache] > [<ffffffffa010c4ac>] ? btree_gc_mark_node+0xdc/0x210 [bcache] > [<ffffffff81091693>] ? __wake_up+0x53/0x70 > [<ffffffffa01103d4>] bch_btree_gc+0x2f4/0x560 [bcache] > [<ffffffff8100180b>] ? __switch_to+0xbb/0x5f0 > [<ffffffff810911f0>] ? woken_wake_function+0x20/0x20 > [<ffffffffa0110678>] bch_gc_thread+0x38/0x120 [bcache] > [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache] > [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache] > [<ffffffff81070a9e>] kthread+0xce/0xf0 > [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff815b8818>] ret_from_fork+0x58/0x90 > > Naturally, checking > /sys/fs/bcache/b9bcddd1-7a9a-4f2f-88e6-cb5bef6abcf2/internal/btree_gc_max_duration_ms > shows: 31593 Clearly at some point the GC overhead becomes so large that it > causes RCU grace period stalls. I'm puzzled since bch_btree_gc_finish(...) is > not listed and this is the only function that pertains to bcache gc AND > executes code in RCU critical read section. > > In addition to that I also observed that the after this RCU stall warn occurs > the bcache_gc thread hogs the machine at 100% rendering it unusable. I managed > to get 2 call stack dumps via magic sysrq as follows: > > ============= > NMI backtrace for cpu 4 > CPU: 4 PID: 18115 Comm: bcache_gc Not tainted 4.0.0-rc6bcache1-nikbor #4 > Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013 > task: ffff88086ab093e0 ti: ffff880868024000 task.ti: ffff880868024000 > RIP: 0010:[<ffffffffa01098cb>] [<ffffffffa01098cb>] > bch_btree_iter_next_filter+0xb/0x50 [bcache] > RSP: 0018:ffff880868027bd8 EFLAGS: 00000202 > RAX: 0000000000000001 RBX: 0000000000002034 RCX: 000000000000000a > RDX: ffffffffa0109920 RSI: ffff88086aa3dcd0 RDI: ffff880868027c08 > RBP: ffff880868027bf8 R08: 0000000000000001 R09: 0000000000000001 > R10: 000007ffffffffff R11: 0000000000000008 R12: ffff88086aa3dcd0 > R13: ffff880868027c08 R14: ffff880868027cf8 R15: ffff880868027dd8 > FS: 0000000000000000(0000) GS:ffff88046fc80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f4f6b410008 CR3: 000000000180e000 CR4: 00000000001406e0 > Stack: > 0000000000002034 ffff88086aa3dcd0 ffff880868027c08 ffff880868027cf8 > ffff880868027c78 ffffffffa0109c80 0000000000000004 0000000000000001 > ffff8807740101c0 ffff88077402a9f8 ffff88086aa3dc00 ffff88086aa3d000 > Call Trace: > [<ffffffffa0109c80>] btree_gc_count_keys+0x50/0x70 [bcache] > [<ffffffffa010ffbf>] btree_gc_recurse+0x1bf/0x2e0 [bcache] > [<ffffffffa010c4ac>] ? btree_gc_mark_node+0xdc/0x210 [bcache] > [<ffffffff81091693>] ? __wake_up+0x53/0x70 > [<ffffffffa01103d4>] bch_btree_gc+0x2f4/0x560 [bcache] > [<ffffffff8100180b>] ? __switch_to+0xbb/0x5f0 > [<ffffffff810911f0>] ? woken_wake_function+0x20/0x20 > [<ffffffffa0110678>] bch_gc_thread+0x38/0x120 [bcache] > [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache] > [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache] > [<ffffffff81070a9e>] kthread+0xce/0xf0 > [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff815b8818>] ret_from_fork+0x58/0x90 > [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70 > Code: ff 48 89 d7 4c 29 cf eb a7 48 29 f2 48 89 d6 e9 18 ff ff ff 66 66 66 2e > 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 <0f> 1f 44 00 00 48 > 89 fb 49 89 f4 49 89 d6 0f 1f 80 00 00 00 00 > > > =========================================== > > NMI backtrace for cpu 4 > CPU: 4 PID: 18115 Comm: bcache_gc Not tainted 4.0.0-rc6bcache1-nikbor #4 > Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013 > task: ffff88086ab093e0 ti: ffff880868024000 task.ti: ffff880868024000 > RIP: 0010:[<ffffffffa0111916>] [<ffffffffa0111916>] > bch_extent_bad+0x106/0x1c0 [bcache] > RSP: 0018:ffff880868027ba8 EFLAGS: 00000202 > RAX: 000000000000bd2a RBX: ffff88077400a550 RCX: 000000000000000a > RDX: 0000000000000004 RSI: 00000000fc390004 RDI: ffff88046c180000 > RBP: ffff880868027bb8 R08: 0000000000000001 R09: 0000000000000000 > R10: 000007ffffffffff R11: 0000000000000008 R12: ffff88086aa3dcd0 > R13: ffff88077400a550 R14: ffffffffa0109920 R15: ffff880868027dd8 > FS: 0000000000000000(0000) GS:ffff88046fc80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f4f6b410008 CR3: 000000000180e000 CR4: 00000000001406e0 > Stack: > ffff880868027c08 ffff88086aa3dcd0 ffff880868027bc8 ffffffffa010992a > ffff880868027bf8 ffffffffa01098f9 00000000000014a6 ffff88086aa3dcd0 > ffff880868027c08 ffff880868027cf8 ffff880868027c78 ffffffffa0109c80 > Call Trace: > [<ffffffffa010992a>] bch_ptr_bad+0xa/0x10 [bcache] > [<ffffffffa01098f9>] bch_btree_iter_next_filter+0x39/0x50 [bcache] > [<ffffffffa0109c80>] btree_gc_count_keys+0x50/0x70 [bcache] > [<ffffffffa010ffbf>] btree_gc_recurse+0x1bf/0x2e0 [bcache] > [<ffffffffa010c4ac>] ? btree_gc_mark_node+0xdc/0x210 [bcache] > [<ffffffff81091693>] ? __wake_up+0x53/0x70 > [<ffffffffa01103d4>] bch_btree_gc+0x2f4/0x560 [bcache] > [<ffffffff8100180b>] ? __switch_to+0xbb/0x5f0 > [<ffffffff810911f0>] ? woken_wake_function+0x20/0x20 > [<ffffffffa0110678>] bch_gc_thread+0x38/0x120 [bcache] > [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache] > [<ffffffffa0110640>] ? bch_btree_gc+0x560/0x560 [bcache] > [<ffffffff81070a9e>] kthread+0xce/0xf0 > [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff815b8818>] ret_from_fork+0x58/0x90 > [<ffffffff810709d0>] ? kthread_freezable_should_stop+0x70/0x70 > Code: 33 25 ff 0f 00 00 48 8b 94 c7 40 0c 00 00 48 89 f0 48 8b 92 d8 0a 00 00 > 48 c1 e8 08 4c 21 d0 48 d3 e8 48 8d 04 40 0f b6 54 82 06 <40> 28 f2 80 fa 80 > 0f 87 7e 00 00 00 0f b6 d2 83 fa 60 76 66 31 > > > In the mean time I'm running the stable 4.0.0 where I observe better results ( > no bcache_gc thread hog but still the occasional stall warn) > > Regards, > Nikolay > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html >
From: Zheng Liu <wenqing.lz@xxxxxxxxxx> In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu <wenqing.lz@xxxxxxxxxx> Tested-by: Joshua Schmid <jschmid@xxxxxxxx> --- drivers/md/bcache/super.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 4dd2bb7..fdbb211 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2100,8 +2100,10 @@ static int __init bcache_init(void) closure_debug_init(); bcache_major = register_blkdev(0, "bcache"); - if (bcache_major < 0) + if (bcache_major < 0) { + unregister_reboot_notifier(&reboot); return bcache_major; + } if (!(bcache_wq = create_workqueue("bcache")) || !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
From: Joshua Schmid <jschmid@xxxxxxxx> Subject: [PATCH] fix a leak in bch_cached_dev_run() Newsgroups: gmane.linux.kernel.bcache.devel Date: 2015-02-03 11:24:06 GMT (3 weeks, 2 days, 11 hours and 43 minutes ago) From: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Tested-by: Joshua Schmid <jschmid@xxxxxxxx> --- drivers/md/bcache/super.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 8c2d657..53f1512 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -880,8 +880,11 @@ void bch_cached_dev_run(struct cached_dev *dc) buf[SB_LABEL_SIZE] = '\0'; env[2] = kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf); - if (atomic_xchg(&dc->running, 1)) + if (atomic_xchg(&dc->running, 1)) { + kfree(env[1]); + kfree(env[2]); return; + } if (!d->c && BDEV_STATE(&dc->sb) != BDEV_STATE_NONE) { -- 2.1.2
From: Joshua Schmid <jschmid@xxxxxxxx> Subject: [PATCH] bcache: [BUG] clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device Newsgroups: gmane.linux.kernel.bcache.devel Date: 2015-02-03 11:18:01 GMT (3 weeks, 2 days, 11 hours and 45 minutes ago) From: Zheng Liu <wenqing.lz@xxxxxxxxxx> This bug can be reproduced by the following script: #!/bin/bash bcache_sysfs="/sys/fs/bcache" function clear_cache() { if [ ! -e $bcache_sysfs ]; then echo "no bcache sysfs" exit fi cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}') sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach" sleep 5 sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach" } for ((i=0;i<10;i++)); do clear_cache done The warning messages look like below: [ 275.948611] ------------[ cut here ]------------ [ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W --------------- ) [ 275.979253] Hardware name: Tecal RH2285 [ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache' [ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.089315] Call Trace: [ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0 [ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170 [ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20 [ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache] [ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]--- [ 276.338241] ------------[ cut here ]------------ [ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720 bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- ) [ 276.386017] Hardware name: Tecal RH2285 [ 276.401430] Couldn't create device <-> cache set symlinks [ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.482169] Call Trace: [ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache] [ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]--- We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach() function when we attach a backing device first time. After detaching this backing device, this flag will be true and sysfs_remove_link() isn't called in bcache_device_unlink(). Then when we attach this backing device again, sysfs_create_link() will return EEXIST error in bcache_device_link(). So the fix is trival and we clear this flag in bcache_device_link(). Signed-off-by: Zheng Liu <wenqing.lz@xxxxxxxxxx> Tested-by: Joshua Schmid <jschmid@xxxxxxxx> --- drivers/md/bcache/super.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 4dd2bb7..f624ae8 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -708,6 +708,8 @@ static void bcache_device_link(struct bcache_device *d, struct cache_set *c, WARN(sysfs_create_link(&d->kobj, &c->kobj, "cache") || sysfs_create_link(&c->kobj, &d->kobj, d->name), "Couldn't create device <-> cache set symlinks"); + + clear_bit(BCACHE_DEV_UNLINK_DONE, &d->flags); } static void bcache_device_detach(struct bcache_device *d) -- 2.1.2
From: Zheng Liu <gnehzuil.liu@xxxxxxxxx> To: linux-bcache@xxxxxxxxxxxxxxx Cc: Zheng Liu <wenqing.lz@xxxxxxxxxx>, Joshua Schmid <jschmid@xxxxxxxx>, Zhu Yanhai <zhu.yanhai@xxxxxxxxx>, Kent Overstreet <kmo@xxxxxxxxxxxxx> Subject: [PATCH v2] bcache: fix a livelock in btree lock Date: Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM) From: Zheng Liu <wenqing.lz@xxxxxxxxxx> This commit tries to fix a livelock in bcache. This livelock might happen when we causes a huge number of cache misses simultaneously. When we get a cache miss, bcache will execute the following path. ->cached_dev_make_request() ->cached_dev_read() ->cached_lookup() ->bch->btree_map_keys() ->btree_root() <------------------------ ->bch_btree_map_keys_recurse() | ->cache_lookup_fn() | ->cached_dev_cache_miss() | ->bch_btree_insert_check_key() -| [If btree->seq is not equal to seq + 1, we should return EINTR and traverse btree again.] In bch_btree_insert_check_key() function we first need to check upgrade flag (op->lock == -1), and when this flag is true we need to release read btree->lock and try to take write btree->lock. During taking and releasing this write lock, btree->seq will be monotone increased in order to prevent other threads modify this in cache miss (see btree.h:74). But if there are some cache misses caused by some requested, we could meet a livelock because btree->seq is always changed by others. Thus no one can make progress. This commit will try to take write btree->lock if it encounters a race when we traverse btree. Although it sacrifice the scalability but we can ensure that only one can modify the btree. Signed-off-by: Zheng Liu <wenqing.lz@xxxxxxxxxx> Tested-by: Joshua Schmid <jschmid@xxxxxxxx> Cc: Joshua Schmid <jschmid@xxxxxxxx> Cc: Zhu Yanhai <zhu.yanhai@xxxxxxxxx> Cc: Kent Overstreet <kmo@xxxxxxxxxxxxx> --- changelog: v2: fix a bug that stops all concurrency writes unconditionally. drivers/md/bcache/btree.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 218f21a..43829d9 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2163,8 +2163,10 @@ int bch_btree_insert_check_key(struct btree *b, struct btree_op *op, rw_lock(true, b, b->level); if (b->key.ptr[0] != btree_ptr || - b->seq != seq + 1) + b->seq != seq + 1) { + op->lock = b->level; goto out; + } } SET_KEY_PTRS(check_key, 1);
From f0e6320a7874af434575f37a11ec6e4992cef790 Mon Sep 17 00:00:00 2001 From: Kent Overstreet <kmo@xxxxxxxxxxxxx> Date: Sat, 1 Nov 2014 13:44:47 -0700 Subject: [PATCH 1/5] bcache: Add a cond_resched() call to gc Git-commit: f0e6320a7874af434575f37a11ec6e4992cef790 Patch-mainline: Submitted References: bnc#910440 Change-id: Id4f18c533b80ddb40df94ed0bb5e2a236a4bc325 Signed-off-by: Takashi Iwai <tiwai@xxxxxxx> --- drivers/md/bcache/btree.c | 1 + 1 file changed, 1 insertion(+) --- a/drivers/md/bcache/btree.c 2014-11-03 16:51:01.720000000 -0800 +++ b/drivers/md/bcache/btree.c 2014-11-03 16:51:26.456000000 -0800 @@ -1741,6 +1741,7 @@ do { ret = btree_root(gc_root, c, &op, &writes, &stats); closure_sync(&writes); + cond_resched(); if (ret && ret != -EAGAIN) pr_warn("gc failed!");