Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator

John Garry <john.garry@xxxxxxxxxx> · Tue, 9 Mar 2021 17:47:15 +0000

On 08/03/2021 19:59, Bart Van Assche wrote:
This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
call and if that causes all mtip_abort_cmd() calls to be skipped?

I'm not sure that I understand this problem you describe. So if 
blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, 
either can happen:
a. normal operation, iter_usage_counter initially holds >= 1, and then 
iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we 
iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() 
will also increase iter_usage_counter.
b. we're switching IO scheduler. In this scenario, first we quiesce 
all queues. After that, there should be no active requests. At that 
point, we ensure any calls to blk_mq_tagset_busy_iter() are finished 
and block (or discard may be a better term) any more calls. Blocking 
any more calls should be safe as there are no requests to iter. 
atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any 
more calls.

Hi Bart,

My concern is about the insertion of the early return statement in 
blk_mq_tagset_busy_iter(). 

So I take this approach as I don't see any way to use a mutual exclusion 
waiting mechanism to block calls to blk_mq_tagset_busy_iter() while the 
IO scheduler is being switched.

The reason is that blk_mq_tagset_busy_iter() can be called from any 
context, including hardirq.

Although most blk_mq_tagset_busy_iter() 
callers can handle skipping certain blk_mq_tagset_busy_iter() calls 
(e.g. when gathering statistics), I'm not sure this is safe for all 
blk_mq_tagset_busy_iter() callers. The example I cited is an example of 
a blk_mq_tagset_busy_iter() call with side effects.

I don't like to think that we're skipping it, which may imply that there 
are some active requests to iter and we're just ignoring them.

It's more like: we know that there are no requests active, so don't 
bother trying to iterate.

The mtip driver allocates one tag set per request queue so quiescing 
queues should be sufficient to address my concern for the mtip driver.

The NVMe core and SCSI core however share a single tag set across 
multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl()
I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls 
blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm 
not sure it is safe to skip the nvme_cancel_request() calls if the I/O 
scheduler for another NVMe namespace is being modified.

Again, I would be relying on all request_queues associated with that 
tagset to be queisced when switching IO scheduler at the point 
blk_mq_tagset_busy_iter() is called and returns early.

Now if there were active requests, I am relying on the request queue 
quiescing to flush them. So blk_mq_tagset_busy_iter() could be called 
during that quiescing period, and would continue to iter the requests.

This does fall over if some tags are allocated without associated 
request queue, which I do not know exists.

Thanks,
John