On 2/4/2020 12:59 PM, Jens Axboe wrote:
On 2/4/20 12:51 AM, Christoph Hellwig wrote:
On Mon, Feb 03, 2020 at 01:07:48PM -0800, Bijan Mottahedeh wrote:
My concern is with the code below for the single bio async case:
qc = submit_bio(bio);
if (polled)
WRITE_ONCE(iocb->ki_cookie, qc);
The bio/dio can be freed before the the cookie is written which is what I'm
seeing, and I thought this may lead to a scenario where that iocb request
could be completed, freed, reallocated, and resubmitted in io_uring layer;
i.e., I thought the cookie could be written into the wrong iocb.
I think we do have a potential use after free of the iocb here.
But taking a bio reference isn't going to help with that, as the iocb
and bio/dio life times are unrelated.
I vaguely remember having that discussion with Jens a while ago, and
tried to pass a pointer to the qc to submit_bio so that we can set
it at submission time, but he came up with a reason why that might not
be required. I'd have to dig out all notes unless Jens remembers
better.
Don't remember that either, so I'd have to dig out emails! But looking
at it now, for the async case with io_uring, the iocb is embedded in the
io_kiocb from io_uring. We hold two references to the io_kiocb, one for
submit and one for completion. Hence even if the bio completes
immediately and someone else finds the completion before the application
doing this submit, we still hold the submission reference to the
io_kiocb. Hence I don't really see how we can end up with a
use-after-free situation here.
IIRC, Bijan had traces showing this can happen, KASAN complaining about
it. Which makes me think that I'm missing a case here, though I don't
immediately see what it is.
Bijan, could post your trace again, I can't seem to find it?
I think the problem may be in the nvme driver's handling of multiple
pollers sharing the same CQ, due to the fact that nvme_poll() drops
cq_poll_lock before completing the CQEs found with nvme_process_cq():
nvme_poll()
{
...
spin_lock(&nvmeq->cq_poll_lock);
found = nvme_process_cq(nvmeq, &start, &end, -1);
spin_unlock(&nvmeq->cq_poll_lock);
nvme_complete_cqes(nvmeq, start, end);
...
}
Furthermore, nvme_process_cq() rings the CQ doorbell after collecting
the CQEs but before processing them:
static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start,
u16 *end, unsigned int tag)
{
...
while (nvme_cqe_pending(nvmeq)) {
...
nvme_update_cq_head(nvmeq);
}
...
nvme_ring_cq_doorbell(nvmeq);
return found;
}
Each poller effectively tells the controller that the CQ is empty when it rings the CQ doorbell. This is ok if there is only one poller but with many of them, I think enough tags can be freed and reissued that CQ could be overrun.
In one specific example:
- Poller 1 find a CQ full of entries in nvme_process_cq()
- Poller 1 processes CQEs, and more pollers find CQE ranges to process
Pollers 2-4 start processing additional non-overlapping CQE ranges
- Poller 5 finds a CQE range that is overlapping with Poller 1
CQ size 1024
Poller 1 2 3 4 5
CQ start index 10 9 214 401 708
CQ end index 9 214 401 708 77
CQ start phase 1 0 0 0 0
CQ end phase 0 0 0 0 1
Poller 1 finds the CQ phase has flipped when processing CQE 821 and indeed the phase has flipped because of poller 5. If I interpret this data correctly, it suggests that Pollers 1 and 5 overlap.
After that I start seeing errors.
A simpler theoretical example with two threads suggested by Matthew Wilcox:
Thread 1 submits enough I/O to fill the CQ
Thread 1 then processes two CQEs, two block layer tags become available.
Thread 1 is preempted by thread 2.
Thread 2 submits two I/Os.
Thread 2 processes the two CQEs which it owns.
Thread 2 submits two more I/Os.
Those CQEs overwrite the next two CQEs that will be processed by thread 1.
Two of thread 1's IOs will not receive a completion. Two of
thread 2's IOs will receive two completions.
Just as a workaround, I held cq_poll_lock while completing the CQEs and see no errors.
Does that make sense?
Thanks.
--bijan