On 9/7/24 16:58, Ming Lei wrote: > On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote: >> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote: >>> When switching io scheduler via sysfs, 'request_module' may be called >>> if the specified scheduler doesn't exist. >>> >>> This was has deadlock risk because the module may be stored on FS behind >>> our disk since request queue is frozen before switching its elevator. >>> >>> Fix it by returning -EDEADLK in case that the disk is claimed, which >>> can be thought as one signal that the disk is mounted. >>> >>> Some distributions(Fedora) simulates the original kernel command line of >>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot >>> hang is triggered. >>> >>> Cc: Richard Jones <rjones@xxxxxxxxxx> >>> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> >>> Cc: Jiri Jaburek <jjaburek@xxxxxxxxxx> >>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> >> >> I'd suggest also: >> >> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166 >> Reported-by: Richard W.M. Jones <rjones@xxxxxxxxxx> >> Reported-by: Jiri Jaburek <jjaburek@xxxxxxxxxx> >> Tested-by: Richard W.M. Jones <rjones@xxxxxxxxxx> >> >> So I have tested this patch and it does fix the issue, at the possible >> cost that now setting the scheduler can fail: >> >> + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler >> + echo noop >> /init: line 109: echo: write error: Resource deadlock avoided >> >> (I know I'm setting it to an impossible value here, but this could >> also happen when setting it to a valid one.) > > Actually in most of dist, io-schedulers are built-in, so request_module > is just a nop, but meta IO must be started. > >> >> Since almost no one checks the result of 'echo foo > /sys/...' that >> would probably mean that sometimes a desired setting is silently not >> set. > > As I mentioned, io-schedulers are built-in for most of dist, so > request_module isn't called in case of one valid io-sched. > >> >> Also I bisected this bug yesterday and found it was caused by (or, >> more likely, exposed by): >> >> commit af2814149883e2c1851866ea2afcd8eadc040f79 >> Author: Christoph Hellwig <hch@xxxxxx> >> Date: Mon Jun 17 08:04:38 2024 +0200 >> >> block: freeze the queue in queue_attr_store >> >> queue_attr_store updates attributes used to control generating I/O, and >> can cause malformed bios if changed with I/O in flight. Freeze the queue >> in common code instead of adding it to almost every attribute. >> >> Reverting this commit on top of git head also fixes the problem. >> >> Why did this commit expose the problem? > > That is really the 1st bad commit which moves queue freezing before > calling request_module(), originally we won't freeze queue until > we have to do it. > > Another candidate fix is to revert it, or at least not do it > for storing elevator attribute. I do not think that reverting is acceptable. Rather, a proper fix would simply be to do the request_module() before freezing the queue. Something like below should work (totally untested and that may be overkill). diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 60116d13cb80..aef87f6b4a8a 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -23,6 +23,7 @@ struct queue_sysfs_entry { struct attribute attr; ssize_t (*show)(struct gendisk *disk, char *page); + int (*pre_store)(struct gendisk *disk, const char *page, size_t count); ssize_t (*store)(struct gendisk *disk, const char *page, size_t count); }; @@ -413,6 +414,14 @@ static struct queue_sysfs_entry _prefix##_entry = { \ .store = _prefix##_store, \ }; +#define QUEUE_RPW_ENTRY(_prefix, _name) \ +static struct queue_sysfs_entry _prefix##_entry = { \ + .attr = { .name = _name, .mode = 0644 }, \ + .show = _prefix##_show, \ + .pre_store = _prefix##_pre_store, \ + .store = _prefix##_store, \ +}; + QUEUE_RW_ENTRY(queue_requests, "nr_requests"); QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb"); QUEUE_RW_ENTRY(queue_max_sectors, "max_sectors_kb"); @@ -420,7 +429,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb"); QUEUE_RO_ENTRY(queue_max_segments, "max_segments"); QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments"); QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size"); -QUEUE_RW_ENTRY(elv_iosched, "scheduler"); +QUEUE_RPW_ENTRY(elv_iosched, "scheduler"); QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size"); QUEUE_RO_ENTRY(queue_physical_block_size, "physical_block_size"); @@ -670,6 +679,12 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr, if (!entry->store) return -EIO; + if (entry->pre_store) { + res = entry->pre_store(disk, page, length); + if (res) + return res; + } + blk_mq_freeze_queue(q); mutex_lock(&q->sysfs_lock); res = entry->store(disk, page, length); diff --git a/block/elevator.c b/block/elevator.c index f13d552a32c8..c338282d5148 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -698,17 +698,26 @@ static int elevator_change(struct request_queue *q, const char *elevator_name) return 0; e = elevator_find_get(q, elevator_name); - if (!e) { - request_module("%s-iosched", elevator_name); - e = elevator_find_get(q, elevator_name); - if (!e) - return -EINVAL; - } + if (!e) + return -EINVAL; ret = elevator_switch(q, e); elevator_put(e); return ret; } +int elv_iosched_pre_store(struct gendisk *disk, const char *buf, + size_t count) +{ + char elevator_name[ELV_NAME_MAX]; + + if (!elv_support_iosched(disk->queue)) + return -ENOTSUPP; + + strscpy(elevator_name, buf, sizeof(elevator_name)); + + return request_module("%s-iosched", elevator_name); +} + ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, size_t count) { diff --git a/block/elevator.h b/block/elevator.h index 3fe18e1a8692..059172c0f93c 100644 --- a/block/elevator.h +++ b/block/elevator.h @@ -148,6 +148,7 @@ extern void elv_unregister(struct elevator_type *); * io scheduler sysfs switching */ ssize_t elv_iosched_show(struct gendisk *disk, char *page); +int elv_iosched_pre_store(struct gendisk *disk, const char *page, size_t count); ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count); extern bool elv_bio_merge_ok(struct request *, struct bio *); -- Damien Le Moal Western Digital Research