On Wed, Sep 29, 2021 at 04:17:01PM +0800, Ming Lei wrote: [full quote deleted] > Draining request won't fix the problem completely: > > 1) blk-mq dispatch code may still be in-progress after q_usage_counter > becomes zero, see the story in 662156641bc4 ("block: don't drain in-progress dispatch in > blk_cleanup_queue()") That commit does not have a good explanation on what it actually fixed. > 2) elevator code / blkcg code may still be called after blk_cleanup_queue(), such > as kyber, trace_kyber_latency()(q->disk is referred) is called in kyber's timer > handler, and the timer is deleted via del_timer_sync() via kyber_exit_sched() > from blk_release_queue(). Yes. There's two things we can do here: - stop using the dev_t in tracing a request_queue - exit the I/O schedules in del_gendisk, because they are only used for file system I/O that requires the gendisk anyway we'll probably want both eventually. > > > + > > + rq_qos_exit(q); > > + blk_sync_queue(q); > > + blk_flush_integrity(); > > + /* > > + * Allow using passthrough request again after the queue is torn down. > > + */ > > + blk_mq_unfreeze_queue(q); > > Again, one FS bio is still possible to enter queue now: submit_bio_checks() > is done before set_capacity(0), and submitted after blk_mq_unfreeze_queue() > returns. Not with the new patch 1 in this series. Jens - can you take a look at the series that fixes the crashes people are sending while I'm looking at the rest of the corner cases?