On 10/31/22 6:03 PM, Jens Axboe wrote: > On 10/31/22 5:35 PM, Al Viro wrote: >> On Mon, Oct 31, 2022 at 04:42:11PM -0600, Jens Axboe wrote: >>> On 10/31/22 4:12 PM, Al Viro wrote: >>>> static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq) >>>> { >>>> struct request *last = rq_list_peek(&plug->mq_list); >>>> >>>> Suppose it's not NULL... >>>> >>>> if (!plug->rq_count) { >>>> trace_block_plug(rq->q); >>>> } else if (plug->rq_count >= blk_plug_max_rq_count(plug) || >>>> (!blk_queue_nomerges(rq->q) && >>>> blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) { >>>> ... and we went here: >>>> blk_mq_flush_plug_list(plug, false); >>>> All requests, including the one last points to, might get fed ->queue_rq() >>>> here. At which point there seems to be nothing to prevent them getting >>>> completed and freed on another CPU, possibly before we return here. >>>> >>>> trace_block_plug(rq->q); >>>> } >>>> >>>> if (!plug->multiple_queues && last && last->q != rq->q) >>>> ... and here we dereference last. >>>> >>>> Shouldn't we reset last to NULL after the call of blk_mq_flush_plug_list() >>>> above? >>> >>> There's no UAF here as the requests aren't freed. We could clear 'last' >>> to make the code more explicit, and that would avoid any potential >>> suboptimal behavior with ->multiple_queues being wrong. >> >> Umm... >> Suppose ->has_elevator is false and so's ->multiple_queues. >> No ->queue_rqs(), so blk_mq_flush_plug_list() grabs rcu_read_lock() and >> hit blk_mq_plug_issue_direct(). >> blk_mq_plug_issue_direct() picks the first request off the list >> and passes it to blk_mq_request_issue_directly(), which passes it >> to __blk_mq_request_issue_directly(). There we grab a tag and >> proceed to __blk_mq_issue_directly(), which feeds request to ->queue_rq(). >> >> What's to stop e.g. worker on another CPU from picking that sucker, >> completing it and calling blk_mq_end_request() which completes all >> bio involved and calls blk_mq_free_request()? >> >> If all of that manages to happen before blk_mq_flush_plug_list() >> returns to caller... Sure, you probably won't hit in on bare >> metal, but if you are in a KVM and this virtual CPU happens to >> lose the host timeslice... I've seen considerably more narrow >> race windows getting hit on such setups. >> >> Am I missing something subtle here? It's been a long time since >> I've read through that area - as the matter of fact, I'm trying >> to refresh my memories of the bio_submit()-related code paths >> at the moment... > > With blk-mq, which all drivers are these days, the request memory is > statically allocated when the driver is setup. I'm not denying that you > could very well have issued AND completed the request which 'last' > points to by the time we dereference it, but that won't be a UAF unless > the device has also been quiesced and removed in between. Which I guess > could indeed be a possibility since it is by definition on a different > queue (->multiple_queues would be true, however, but that's also what > would be required to reach that far into that statement). > > This is different from the older days of a request being literally freed > when it completes, which is what I initially reacted to. > > As mentioned in the original reply, I do think we should just clear > 'last' as you suggest. But it's not something we've seen on the FB fleet > of servers, even with the majority of hosts running this code (and on > VMs). Forgot to ask - do you want to send a patch for that, or do you just want me to cook one up with a reported-by for you? -- Jens Axboe