On Fri, Aug 20, 2021 at 04:38:24PM -0700, Bart Van Assche wrote: > On 8/20/21 4:05 PM, Niklas Cassel wrote: > > Thank you for your patch! > > I tested it, and it does solve my problem. > > That's quick. Thanks! Thank you for the patch! > > > I've been thinking more about this problem. > > The problem is seen on a SATA zoned drive. > > > > These drives have mq-deadline set as default by the > > blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE) call in > > drivers/scsi/sd_zbc.c:sd_zbc_read_zones() > > > > This triggers block/elevator.c:elevator_init_mq() to initialize > > "mq-deadline" as default scheduler for these devices. > > > > I think that the problem might because that drivers/scsi/sd_zbc.c > > has created the request_queue and submitted requests, before the call > > to elevator_init_mq() is done. > > > > elevator_init_mq() will set q->elevator->type->ops, so once that is set, > > blk_mq_free_request() will call e->type->ops.finish_request(rq), > > regardless if the request was inserted through the recently initialized > > scheduler or not. > > > > While I'm perfectly happy with your fix, would it perhaps be possible > > to do the fix in block/elevator.c instead, so that we don't need to > > do the same type of check that you did, in each and every single > > io scheduler? > > > > Looking at block/elevator.c:elevator_init_mq(), it seems to do: > > > > blk_mq_freeze_queue() > > blk_mq_quiesce_queue() > > > > blk_mq_init_sched() > > > > blk_mq_unquiesce_queue() > > blk_mq_unfreeze_queue() > > > > This obviously isn't enough to avoid the bug that we are seeing, > > but could perhaps a more general fix be to flush/wait until all > > in-flight requests have completed, and then free them, and then > > set q->elevator->type->ops. That way, all requests inserted after > > the io scheduler has been initialized, will have gone through the > > io scheduler. So all finish_request() calls should have a > > matching insert_request() call. What do you think? > > q->elevator is set from inside the I/O scheduler's init_sched callback and > that callback is called with the request queue frozen. Freezing happens by > calling blk_mq_freeze_queue() and that function waits until all previously > submitted requests have finished. So I don't think that the race described > above can happen. I see. I was mainly thinking that it should be possible to do a generic fix, such that we eventually won't need a similar fix as yours in all the different I/O schedulers. However, looking at e.g. BFQ it does appear to have something similar to your fix already: #define RQ_BFQQ(rq) ((rq)->elv.priv[1]) bfq_finish_requeue_request() struct bfq_queue *bfqq = RQ_BFQQ(rq); ... if (!rq->elv.icq || !bfqq) return; So your proposed fix should also be fine. However, it does not apply on top of Torvalds master or Jens's for-next branch because they both have reverted your cgroup support patch. If you rebase your fix and send it out, I will be happy to send out a Reviewed-by/Tested-by. Kind regards, Niklas