On Tue, Feb 19, 2019 at 8:48 AM Bart Van Assche <bvanassche@xxxxxxx> wrote: > > On Fri, 2019-02-15 at 10:29 -0800, Evan Green wrote: > > I got all turned around while trying to understand this fix, and I'll > > admit it's probably just me. It looks like you're trying to use an rcu > > lock to prevent the iter functions from racing with free. Is that > > true? But then the race at least as I understand it wasn't that there > > was a free in-flight, it's that the free had happened a long time ago, > > and there was still a stale value in tags->rqs[bitnr]. So maybe > > there's some other issue that part is solving? > > > > And then it looks like you added a new struct where tags->rqs was so > > that you could compare hctx->queue without reaching through rq. I have > > no idea if that's sufficient to prevent stale accesses through rq. > > Since you're changing multiple values I think where you populate that > > structure you'd at least need to do something like: clear rq, barrier, > > set hctx, barrier, set rq. But like Bart said, that's probably not the > > right way to go. > > > > Finally, I didn't really get why the conditional in > > blk_mq_rq_inflight() changed. Is that guarding against some other > > identified problem, or just an additional check you can do when you > > convert rqs into a struct? > > > > It looks like blk_mq_free_rqs() might be the magic function I was > > looking for earlier. Would it be possible to just clear tags[rq->tag] > > for each static_rq? Or is it possible for rqs from one set to end up > > in the tags array of another set? (Which would make what I just > > suggested insufficient). > > Hi Evan, > > Multiple request queues can share a single tag set. Examples of use cases > are NVMe namespaces associated with the same NVMe controller and SCSI LUNs > associated with the same SCSI host. The code that allocates a request not > only marks a tag as allocated but also updates the rqs[] array in the > tag set (see also blk_mq_get_request()). The code that iterates over a > tag set examines both the tag bitmap and the rqs[] array. The code that > allocates requests and the code that iterates over requests are not > serialized against each other. That is why the association of a tag with a > request queue can change while iterating over a tag set. > > Another race condition is that a request can be allocated or freed while > iterating over a tag set. Oh, true. The iter code was trying (unsuccessfully) to guard against allocations with its if (rq && rq->q ...), but I hadn't thought about in-progress frees. > > Fixing these race conditions would require a locking mechanism and I think > the performance overhead of such a locking mechanism is larger than > acceptable. So code that iterates over requests either must be able to > handle such race conditions or must suspend request processing before it > iterates over requests. > > This is why I'm in favor of only modifying blk_mq_free_rqs() such that the > memory in which the tags are stored is delayed until a grace period has > occurred. This makes sense to me for the purpose of not racing with an in-progress free. I just can't fully prove to myself that it solves the stale tags entry, since unless we clear the entry in the tags array it seems like there could maybe still be stale entries. Maybe I'll get it in the next change. Jens, keep me in the loop if you send another patch please. -Evan