[ changed subject, more below ] On Fri, 29 Jan 2016, Johannes Thumshirn wrote: > [ +cc Kent ] > > On Wed, Jan 27, 2016 at 02:57:25PM +0000, Yannis Aribaud wrote: > > Hi, > > > > After several weeks using the 4.2.6 kernel + patches from Ewheeler we just ran into a crash again. > > This time the kernel was still running and the server was responsive but not able to do any IO on the bcache devices. > > > > [696983.683498] bcache_writebac D ffffffff810643df 0 5741 2 0x00000000 > > [696983.683505] ffff88103d01f180 0000000000000046 ffff88107842d000 ffffffff811a95cd > > [696983.683510] 0000000000000000 ffff8810388c4000 ffff88103d01f180 0000000000000001 > > [696983.683514] ffff882034ae0c10 0000000000000000 ffff882034ae0000 ffffffff8139601e > > [696983.683518] Call Trace: > > [696983.683530] [<ffffffff811a95cd>] ? blk_queue_bio+0x262/0x279 > > [696983.683539] [<ffffffff8139601e>] ? schedule+0x6b/0x78 > > [696983.683553] [<ffffffffa032ce9b>] ? closure_sync+0x66/0x91 [bcache] > > [696983.683563] [<ffffffffa033c89f>] ? bch_writeback_thread+0x622/0x6b5 [bcache] > > [696983.683569] [<ffffffff8100265c>] ? __switch_to+0x1de/0x3f7 > > [696983.683578] [<ffffffffa033c89f>] ? bch_writeback_thread+0x622/0x6b5 [bcache] > > [696983.683586] [<ffffffffa033c27d>] ? write_dirty_finish+0x1bf/0x1bf [bcache] > > [696983.683594] [<ffffffff810589d6>] ? kthread+0x99/0xa1 > > [696983.683598] [<ffffffff8105893d>] ? kthread_parkme+0x16/0x16 > > [696983.683603] [<ffffffff813986df>] ? ret_from_fork+0x3f/0x70 > > [696983.683607] [<ffffffff8105893d>] ? kthread_parkme+0x16/0x16 > > > > Don't know if this help. > > Unfortunately I thing that we will rollback and stop using Bcache unless this is really fixed :/ > > > > Hi Yannis, > > Do you have a machine with a bcache setup running where you can reproduce the > error? Or do you know a method to reproduce the error? > > What I'd be interested in is which locks are held when it locks up (you can > acquire this information with SysRq+d or echo d > /proc/sysrq-trigger. > > Kent, do you have an idea what's happening here? Are you using md-based raid5/raid6? If so, it could be the raid_partial_stripes_expensive bug. If it *is* the bcache optimizations around partial_stripes_expensive, then please try the patch below and then do this *before* loading the bcache module (or at least before registering the bcache backing volume): echo 0 > /sys/block/BLK/queue/limits/raid_partial_stripes_expensive where BLK is your backing volume. I wrote this patch because we use hardware RAID5/6 and wanted to get the partial_stripes_expensive optimizations on by setting raid_partial_stripes_expensive=1 and io_opt=our_stride_width. Unfortunately it caused the backtrace in the LKML thread below, so we stopped using it. See also this thread, however, it shows a backtrace prior to removing the bio splitting code: https://lkml.org/lkml/2016/1/7/844 Note that commit 749b61dab30736eb95b1ee23738cae90973d4fc3 might not exactly address this issue, but it might prevent full hangs. Make sure you cherry-pick commit 749b61dab30736eb95b1ee23738cae90973d4fc3 and hand-clean-up as necessary. It simplifies the bio code and deletes a bunch of stuff. It showed up in 4.4 and isn't in my patchset. After commit the commit above and setting raid_partial_stripes_expensive=1 we still get errors like this: bcache: bch_count_io_errors() dm-3: IO error on writing data to cache, recovering but they don't lock up the system. Ultimately we run with raid_partial_stripes_expensive=0 because of these related problemsand haven't had any issues. Also, fwiw, we run 4.1.y as our stable branch backed by hardware raid. See my queue_limits patch below which exposes <struct request_queue>::limits via sysfs. If anyone does have a fix for this, I would love to see it---because then we could use the bcache raid_partial_stripes_expensive/opt_io optimizations! -Eric diff --git a/block/blk-core.c b/block/blk-core.c index 03b5f8d..54006aa 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -642,6 +642,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) INIT_DELAYED_WORK(&q->delay_work, blk_delay_work); kobject_init(&q->kobj, &blk_queue_ktype); + kobject_init(&q->ql_kobj, &blk_queue_limits_ktype); mutex_init(&q->sysfs_lock); spin_lock_init(&q->__queue_lock); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 2b8fd30..9c29b44 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -430,6 +430,119 @@ static struct attribute *default_attrs[] = { NULL, }; +/* First we define a show/store function generator + * for `struct queue_limits` */ +#define QL_STORE_FUNC_NAME(name) queue_limits_store_##name +#define QL_STORE_FUNC(name, type) \ + static ssize_t \ + QL_STORE_FUNC_NAME(name) (struct request_queue *q, const char *page, size_t count)\ + { \ + unsigned long val; \ + ssize_t ret = queue_var_store(&val, page, count); \ + if (ret < 0) \ + return ret; \ + q->limits.name = (type) val; \ + return ret; \ + } + +#define QL_SHOW_FUNC_NAME(name) queue_limits_show_##name +#define QL_SHOW_FUNC(name, type) \ + static ssize_t \ + QL_SHOW_FUNC_NAME(name) (struct request_queue *q, char *page) \ + { \ + return queue_var_show((unsigned long)q->limits.name, (page)); \ + } + + +/* Next, the struct generator that defines the + * structs for the queue_limits sysfs interface */ +#define QL_SYSFS_STRUCT_NAME(name) queue_limits_struct_##name +#define QL_SYSFS_STRUCT(__name, __mode, __show, __store) \ + static struct queue_sysfs_entry QL_SYSFS_STRUCT_NAME(__name) = {\ + .attr = {.name = #__name, .mode = __mode }, \ + .show = __show, \ + .store = __store, \ + }; + +/* Define common read-only/read-write with functions and mode flags: */ +#define QL_SYSFS_RO(name, type) \ + QL_SHOW_FUNC(name, type); \ + QL_SYSFS_STRUCT(name, S_IRUGO, QL_SHOW_FUNC_NAME(name), NULL) + +#define QL_SYSFS_RW(name, type) \ + QL_SHOW_FUNC(name, type); \ + QL_STORE_FUNC(name, type); \ + QL_SYSFS_STRUCT(name, S_IRUGO | S_IWUSR, \ + QL_SHOW_FUNC_NAME(name), QL_STORE_FUNC_NAME(name)); + + +QL_SYSFS_RO(bounce_pfn, unsigned long); +QL_SYSFS_RO(seg_boundary_mask, unsigned long); + +QL_SYSFS_RO(max_hw_sectors, unsigned int); +QL_SYSFS_RO(chunk_sectors, unsigned int); +QL_SYSFS_RO(max_sectors, unsigned int); +QL_SYSFS_RO(max_segment_size, unsigned int); +QL_SYSFS_RO(physical_block_size, unsigned int); +QL_SYSFS_RO(alignment_offset, unsigned int); +QL_SYSFS_RO(io_min, unsigned int); +QL_SYSFS_RW(io_opt, unsigned int); +QL_SYSFS_RO(max_discard_sectors, unsigned int); +QL_SYSFS_RO(max_write_same_sectors, unsigned int); +QL_SYSFS_RO(discard_granularity, unsigned int); +QL_SYSFS_RO(discard_alignment, unsigned int); + +QL_SYSFS_RO(logical_block_size, unsigned short); +QL_SYSFS_RO(max_segments, unsigned short); +QL_SYSFS_RO(max_integrity_segments, unsigned short); + +QL_SYSFS_RO(misaligned, unsigned char); +QL_SYSFS_RO(discard_misaligned, unsigned char); +QL_SYSFS_RO(cluster, unsigned char); +QL_SYSFS_RO(discard_zeroes_data, unsigned char); +QL_SYSFS_RW(raid_partial_stripes_expensive, unsigned char); + + +/* sysfs attribute list for queue_limits */ +static struct attribute *queue_limits_attrs[] = { + &QL_SYSFS_STRUCT_NAME(bounce_pfn).attr, + &QL_SYSFS_STRUCT_NAME(seg_boundary_mask).attr, + + &QL_SYSFS_STRUCT_NAME(max_hw_sectors).attr, + &QL_SYSFS_STRUCT_NAME(chunk_sectors).attr, + &QL_SYSFS_STRUCT_NAME(max_sectors).attr, + &QL_SYSFS_STRUCT_NAME(max_segment_size).attr, + &QL_SYSFS_STRUCT_NAME(physical_block_size).attr, + &QL_SYSFS_STRUCT_NAME(alignment_offset).attr, + &QL_SYSFS_STRUCT_NAME(io_min).attr, + &QL_SYSFS_STRUCT_NAME(io_opt).attr, + &QL_SYSFS_STRUCT_NAME(max_discard_sectors).attr, + &QL_SYSFS_STRUCT_NAME(max_write_same_sectors).attr, + &QL_SYSFS_STRUCT_NAME(discard_granularity).attr, + &QL_SYSFS_STRUCT_NAME(discard_alignment).attr, + + &QL_SYSFS_STRUCT_NAME(logical_block_size).attr, + &QL_SYSFS_STRUCT_NAME(max_segments).attr, + &QL_SYSFS_STRUCT_NAME(max_integrity_segments).attr, + + &QL_SYSFS_STRUCT_NAME(misaligned).attr, + &QL_SYSFS_STRUCT_NAME(discard_misaligned).attr, + &QL_SYSFS_STRUCT_NAME(cluster).attr, + &QL_SYSFS_STRUCT_NAME(discard_zeroes_data).attr, + &QL_SYSFS_STRUCT_NAME(raid_partial_stripes_expensive).attr, + + NULL +}; +/* cleanup */ +#undef QL_SYSFS_RW +#undef QL_SYSFS_RO +#undef QL_SYSFS_STRUCT +#undef QL_SYSFS_STRUCT_NAME +#undef QL_SHOW_FUNC +#undef QL_SHOW_FUNC_NAME +#undef QL_STORE_FUNC +#undef QL_STORE_FUNC_NAME + #define to_queue(atr) container_of((atr), struct queue_sysfs_entry, attr) static ssize_t @@ -474,6 +587,48 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr, return res; } +static ssize_t +queue_limits_attr_show(struct kobject *kobj, struct attribute *attr, char *page) +{ + struct queue_sysfs_entry *entry = to_queue(attr); + struct request_queue *q = + container_of(kobj, struct request_queue, ql_kobj); + ssize_t res; + + if (!entry->show) + return -EIO; + mutex_lock(&q->sysfs_lock); + if (blk_queue_dying(q)) { + mutex_unlock(&q->sysfs_lock); + return -ENOENT; + } + res = entry->show(q, page); + mutex_unlock(&q->sysfs_lock); + return res; +} + +static ssize_t +queue_limits_attr_store(struct kobject *kobj, struct attribute *attr, + const char *page, size_t length) +{ + struct queue_sysfs_entry *entry = to_queue(attr); + struct request_queue *q; + ssize_t res; + + if (!entry->store) + return -EIO; + + q = container_of(kobj, struct request_queue, ql_kobj); + mutex_lock(&q->sysfs_lock); + if (blk_queue_dying(q)) { + mutex_unlock(&q->sysfs_lock); + return -ENOENT; + } + res = entry->store(q, page, length); + mutex_unlock(&q->sysfs_lock); + return res; +} + static void blk_free_queue_rcu(struct rcu_head *rcu_head) { struct request_queue *q = container_of(rcu_head, struct request_queue, @@ -531,6 +686,17 @@ static const struct sysfs_ops queue_sysfs_ops = { .store = queue_attr_store, }; +static const struct sysfs_ops queue_limits_sysfs_ops = { + .show = queue_limits_attr_show, + .store = queue_limits_attr_store, +}; + +struct kobj_type blk_queue_limits_ktype = { + .sysfs_ops = &queue_limits_sysfs_ops, + .default_attrs = queue_limits_attrs, + .release = blk_release_queue, +}; + struct kobj_type blk_queue_ktype = { .sysfs_ops = &queue_sysfs_ops, .default_attrs = default_attrs, @@ -573,6 +739,12 @@ int blk_register_queue(struct gendisk *disk) } kobject_uevent(&q->kobj, KOBJ_ADD); + + ret = kobject_add(&q->ql_kobj, kobject_get(&q->kobj), "%s", "limits"); + if (ret < 0) + return ret; + + kobject_uevent(&q->ql_kobj, KOBJ_ADD); if (q->mq_ops) blk_mq_register_disk(disk); @@ -605,6 +777,10 @@ void blk_unregister_queue(struct gendisk *disk) if (q->request_fn) elv_unregister_queue(q); + kobject_uevent(&q->ql_kobj, KOBJ_REMOVE); + kobject_del(&q->ql_kobj); + kobject_put(&q->kobj); + kobject_uevent(&q->kobj, KOBJ_REMOVE); kobject_del(&q->kobj); blk_trace_remove_sysfs(disk_to_dev(disk)); diff --git a/block/blk.h b/block/blk.h index 43b0361..1342e9a 100644 --- a/block/blk.h +++ b/block/blk.h @@ -28,6 +28,7 @@ struct blk_flush_queue { extern struct kmem_cache *blk_requestq_cachep; extern struct kmem_cache *request_cachep; extern struct kobj_type blk_queue_ktype; +extern struct kobj_type blk_queue_limits_ktype; extern struct ida blk_queue_ida; static inline struct blk_flush_queue *blk_get_flush_queue( diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5d93a66..b58fbf3 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -401,6 +401,12 @@ struct request_queue { * mq queue kobject */ struct kobject mq_kobj; + + /* + * queue_limits kobject + */ + struct kobject ql_kobj; + #ifdef CONFIG_PM struct device *dev; -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html