Re: [PATCH 4/5] xhci: Fix command ring pointer corruption while aborting a command

Pavan Kondeti <quic_pkondeti@xxxxxxxxxxx> · Thu, 28 Oct 2021 13:33:02 +0530



On Wed, Oct 27, 2021 at 04:59:11PM +0300, Mathias Nyman wrote:
> > This patch passes my test. Can you please clarify the below question that is
> > bothering me.
> 
> Sure, currently this works as we only have one command ring segment.
> But lets look at a theoretical case with two segments far apart where
> the last command on a segment is aborted
> 
> > 
> > Here crcr points to the DMA address of the command which is getting
> > executed (soon will be aborted) by the xHC. After the ring is stopped,
> > we want xHC to execute the *next* command.
> > 
> > Is it guaranteed that the upper 32 bits of previous and next commands will
> > be same? Because when the issue happens, xHC takes the 32 bits of crcr
> > variable and update it's internal pointer from which it will fetch the commands
> > next time the ring is started. I think it is guaranteed because the upper 32
> > bits only may change when we cross the segments but in which case there will
> > be a link TRB in the middle. Since xHC command ring is stopped, it won't be
> > fetching the link TRB until door bell is rung again. Is my understanding correct?
> 
> Good point. 
> 
> This could be an issue as xHC will move the internal pointer to the next
> command TRB (past link) right after generating a "command aborted" event,
> before generating the "command ring stopped" event. The command ring stopped
> event should thus point to the next command TRB to handle.
> 
> I'll assume that the "command aborted" and "command ring stopped" events are generated
> before xHC updates the internal pointer based on CRCR write.
> 
> The CRCR write may thus move the internal dequeue pointer back to previous segment.
> 
> This may mess things up, even if the interrupt handler turns the aborted command TRB
> to a no-op, it's possible the partial CRCR update could cause internal pointer to 
> point to first command of old segment. 
> 
> > 
> > Also, what if there is a race with xHC just finishing this command (which we
> > are currently aborting) and next link TRB is traversed and processing the next
> > command in a different segment. For some reason, we could not update our
> > deq pointer and in middle of aborting the command (which is already completed)
> > and updating the higher 32 bit with the previous deq segment. This is a
> > hypothetical question as we are only using 1 segment for the command ring.
> 
> In this case we won't see the "command aborted" event. Only a "command ring
> stopped" event. and yes, the CRCR write could cause the internal dequeue pointer
> to be moved back to the old segment, and we have the same issues as in the 
> previous case.
> 
> These two cases could probably solved by writing the next command pointer to CRCR
> instead.
> 
> Only problem I see with this is if xHC hasn't even fetched the command TRB we are
> aborting. Then ring will stop at this last command TRB on segment.
> Writing the next command TRB pointer to CRCR forces xHC to move to the next
> segment without xHC processing the Link TRB, and cycle bit might get out of sync.
> 
> Hard to imagine how we end up in this situation.  
> 
> Then there is the third case where xHC handled multiple command TRBs, but interrupts
> are blocked so driver is unaware of their completion.
> Here its possible we write the wrong segment to CRCR even if we use the next
> command TRB pointer. xHC could be way past what driver thinks is the next command.
> 
> For this case we proably need to check for pending interrupts and unhandled command
> completions on the event ring before aborting the command.
> But that is probably another patch.
> 
> But for now the best solution seems to be just to write next command TRB
> pointer to CRCR during abort.
> 
Thanks Mathias for the detailed explanation. Agree that your patch is good
enough for the current situation. May be a TODO comment here would help. 

Thanks,
Pavan