Re: [PATCH 4/5] block: drain file system I/O on del_gendisk

Christoph Hellwig <hch@xxxxxx> · Fri, 1 Oct 2021 06:13:48 +0200

On Wed, Sep 29, 2021 at 04:17:01PM +0800, Ming Lei wrote:

[full quote deleted]

> Draining request won't fix the problem completely:
> 
> 1) blk-mq dispatch code may still be in-progress after q_usage_counter
> becomes zero, see the story in 662156641bc4 ("block: don't drain in-progress dispatch in
> blk_cleanup_queue()")

That commit does not have a good explanation on what it actually fixed.

> 2) elevator code / blkcg code may still be called after blk_cleanup_queue(), such
> as kyber, trace_kyber_latency()(q->disk is referred) is called in kyber's timer
> handler, and the timer is deleted via del_timer_sync() via kyber_exit_sched()
> from blk_release_queue().

Yes.  There's two things we can do here:

 - stop using the dev_t in tracing a request_queue
 - exit the I/O schedules in del_gendisk, because they are only used
   for file system I/O that requires the gendisk anyway

we'll probably want both eventually.

> 
> > +
> > +	rq_qos_exit(q);
> > +	blk_sync_queue(q);
> > +	blk_flush_integrity();
> > +	/*
> > +	 * Allow using passthrough request again after the queue is torn down.
> > +	 */
> > +	blk_mq_unfreeze_queue(q);
> 
> Again, one FS bio is still possible to enter queue now: submit_bio_checks()
> is done before set_capacity(0), and submitted after blk_mq_unfreeze_queue()
> returns.

Not with the new patch 1 in this series.

Jens - can you take a look at the series that fixes the crashes people
are sending while I'm looking at the rest of the corner cases?