2012/6/12 Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx>: > Hi Elric, > > Are you going to have time to rework this patch? I haven't heard back > from you for a week. > > If I don't hear from you in a couple days, I'll assume you're not > checking your email. Alex, would you consider making the changes > yourself if that happens? I am so sorry. I was assigned a emergency job by my boss since April, and just came back on last weekend. I will check and response these emails carefully. Best Regards, Elric > > Comments below. > > On Fri, Jun 01, 2012 at 02:39:05PM +0300, Alexander Shishkin wrote: >> Elric Fu <elricfu1@xxxxxxxxx> writes: >> > Hi Sarah, >> > >> > please test it in your old VIA USB3.0 hub. I can't duplicate >> > the case of the address device command timeout in my VIA >> > USB3.0 hub. I heard that the GENESYS GL3310 sata bridge would >> > respond the address device command after a long time when it >> > was plugged into VIA xHCI host. And the issue will occur after >> > several quick disconnect/connect cycles. Yesterday I finally >> > got a GL3310 device. But it is ok and responds the address >> > device command very quickly. I can't duplicate it. >> > >> > In additional, any comment about the patch? >> >> Hi, >> >> I've tested this patch (it requires some rediffing) with a UDC that I >> hacked to ignore SET_ADDRESS requests and it does fix the problem. Some >> comments below. > > First, in general, I would suggest you only manipulate the command ring > state while holding the xhci->lock. Since we can have multiple > outstanding commands that need to be aborted, you need to hold the lock > in places like this: > > temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); > if (!(temp_64 & CMD_RING_RUNNING)) { > xhci_dbg(xhci, "Command ring had been stopped\n"); > return -EINVAL; > } > xhci->cmd_ring_state = CMD_RING_STATE_ABORTED; > xhci_write_64(xhci, temp_64 | CMD_RING_ABORT, > &xhci->op_regs->cmd_ring); > > And what happens if you have multiple commands that need to be aborted, > and the abort handlers race with each other? > > You don't have that problem now, because this patch only adds support > for aborting the Set Address command. However, I also want to abort > other commands, like Configure Endpoint, etc. > > > This patch needs to be modified to cover several corner cases: > > 1. Each device can have commands outstanding independent of whether > other devices have commands queued. > > 2. Each device can have multiple commands queued for it at the same time. > > 3. The same type of command could be queued for the same device. It may > not make sense for most of the commands, but I want to future proof this > code in case that ever becomes true. > > 4. Commands could have different timeout values, or be aborted early > because of a signal from userspace. Therefore, the command we need to > abort may not be the command that is pointed to by the command ring > dequeue pointer. > > 5. Aborted commands need to have all memory for those commands freed by > the stopped command ring handler. Otherwise we race between the > function that issued the command ring abort and the handler. This means > the stopped command ring handler needs to free the xhci_command for any > aborted commands, and remove the command from the xhci_virt_device > command wait list. > > >> [snip] >> > +int xhci_abort_cmd_ring(struct xhci_hcd *xhci) >> >> Shouldn't this be static? >> >> > +{ >> > + u64 temp_64; >> > + int ret; >> > + >> > + xhci_dbg(xhci, "Abort command ring\n"); >> > + >> > + if (!(xhci->cmd_ring_state & CMD_RING_STATE_RUNNING)) { >> > + xhci_dbg(xhci, "The command ring isn't running, " >> > + "Have the command ring been stopped?"); >> > + return -EINVAL; >> > + } >> > + >> > + temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); >> > + if (!(temp_64 & CMD_RING_RUNNING)) { >> > + xhci_dbg(xhci, "Command ring had been stopped\n"); >> > + return -EINVAL; >> > + } > > Don't return an error code here. It's possible that two different > command aborts will race with each other, and you should be able to > handle that case. If you return -EINVAL here, xhci_cancel_cmd() will > tell the USB core that the host controller died, which is not what you > want. > > Instead, have xhci_cancel_cmd() add the command to the command list > under the xHCI lock, and read the command ring status while you're > holding that lock. Drop the lock, and take a look at the command ring > status. > > If it was stopped, you know handle_stopped_cmd_ring() hadn't run yet > when you queued the command to the cancel list (possibly because the > event ring handler hadn't got to the command stopped event TRB yet). > But you do know it *will* run and handle your command. So if the > command ring status was stopped, don't call xhci_abort_cmd_ring(), and > just return. > >> > + xhci->cmd_ring_state = CMD_RING_STATE_ABORTED; >> > + xhci_write_64(xhci, temp_64 | CMD_RING_ABORT, >> > + &xhci->op_regs->cmd_ring); >> > + >> > + /* Section 4.6.1.2 of xHCI 1.0 spec says software should >> > + * time the completion od all xHCI commands, including >> > + * the Command Abort operation. If software doesn't see >> > + * CRR negated in a timely manner (e.g. longer than 5 >> > + * seconds), then it should assume that the there are >> > + * larger problems with the xHC and assert HCRST. >> > + */ >> > + ret = handshake(xhci, &xhci->op_regs->cmd_ring, >> > + CMD_RING_RUNNING, 0, 5 * 1000 * 1000); >> > + if (ret < 0) { >> > + xhci_err(xhci, "Stopped the command ring failed, " >> > + "maybe the host is dead\n"); >> > + xhci->xhc_state |= XHCI_STATE_DYING; >> > + xhci_quiesce(xhci); >> > + xhci_halt(xhci); >> > + return -ESHUTDOWN; >> > + } >> > + >> > + return 0; >> > +} >> > + >> > +static int queue_cd(struct xhci_hcd *xhci, int slot_id, >> > + u32 cmd_type, int ep_index) >> > +{ >> > + struct xhci_cd *cd; >> > + >> > + cd = kzalloc(sizeof(struct xhci_cd), GFP_ATOMIC); >> > + if (!cd) >> > + return -ENOMEM; >> > + INIT_LIST_HEAD(&cd->cancel_cmd_list); >> > + >> > + cd->slot_id = slot_id; >> > + cd->cmd_type = cmd_type; >> > + cd->ep_index = ep_index; >> > + list_add_tail(&cd->cancel_cmd_list, &xhci->cancel_cmd_list); >> > + >> > + return 0; >> > +} >> > + >> > +int xhci_cancel_cmd(struct xhci_hcd *xhci, int slot_id, >> > + u32 cmd_type, int ep_index) >> > +{ >> > + int retval; >> > + unsigned long flags; >> > + >> > + spin_lock_irqsave(&xhci->lock, flags); >> >> It looks like xhci_abort_cmd_ring() can race with xhci_ring_cmd_db(), >> since there doesn't seem to be a requirement for the latter to be called >> under &xhci->lock. I found one case of this in xhci_run_finished() for >> controllers with NEC quirk enabled. >> >> > + >> > + if (xhci->xhc_state & XHCI_STATE_DYING) { >> > + xhci_warn(xhci, "Abort the command ring," >> > + " but the xHCI is dead.\n"); >> > + spin_unlock_irqrestore(&xhci->lock, flags); >> > + return -ESHUTDOWN; >> > + } >> > + >> > + /* queue the cmd desriptor to cancelled_cd_list */ >> > + retval = queue_cd(xhci, slot_id, cmd_type, ep_index); >> > + if (retval) { >> > + spin_unlock_irqrestore(&xhci->lock, flags); >> > + return retval; >> > + } >> > + >> > + /* abort command ring */ >> > + retval = xhci_abort_cmd_ring(xhci); >> > + if (retval) { >> > + xhci_err(xhci, "Abort command ring failed\n"); >> > + spin_unlock_irqrestore(&xhci->lock, flags); >> > + if (retval == -ESHUTDOWN) { >> > + usb_hc_died(xhci_to_hcd(xhci)->primary_hcd); >> > + xhci_dbg(xhci, "xHCI host controller is dead.\n"); >> > + } >> > + return retval; >> > + } >> > + >> > + spin_unlock_irqrestore(&xhci->lock, flags); >> > + return 0; >> > +} >> > + >> > void xhci_ring_ep_doorbell(struct xhci_hcd *xhci, >> > unsigned int slot_id, >> > unsigned int ep_index, >> > @@ -1050,6 +1152,127 @@ static int handle_cmd_in_cmd_wait_list(struct xhci_hcd *xhci, >> > return 1; >> > } >> > >> > +/* >> > + * Find the command trb need to be cancelled and modify it to NO OP >> > + * command. >> > + * >> > + * If we can't find the command trb, we think it had already been >> > + * executed. >> > + */ >> > +static void cd_to_noop(struct xhci_hcd *xhci, struct xhci_cd *cur_cd) >> > +{ >> > + struct xhci_segment *cur_seg; >> > + union xhci_trb *cmd_trb; >> > + int slot_id; >> > + u32 cmd_type; >> > + u32 cycle_state; >> > + int ep_index; >> > + >> > + if (xhci->cmd_ring->dequeue == xhci->cmd_ring->enqueue) >> > + return; >> > + >> > + /* find the current segmene of command ring */ >> > + cur_seg = find_trb_seg(xhci->cmd_ring->first_seg, >> > + xhci->cmd_ring->dequeue, &cycle_state); >> > + >> > + /* find the command trb matched by CD from command ring */ >> > + for (cmd_trb = xhci->cmd_ring->dequeue; >> > + cmd_trb != xhci->cmd_ring->enqueue; >> > + next_trb(xhci, xhci->cmd_ring, &cur_seg, &cmd_trb)) { >> > + if (TRB_TYPE_LINK_LE32(cmd_trb->generic.field[3])) >> > + continue; >> > + >> > + slot_id = TRB_TO_SLOT_ID( >> > + le32_to_cpu(cmd_trb->generic.field[3])); >> > + cmd_type = TRB_FIELD_TO_TYPE( >> > + le32_to_cpu(cmd_trb->generic.field[3])); >> > + if ((slot_id == cur_cd->slot_id) && >> > + (cmd_type == cur_cd->cmd_type)) { >> > + >> > + /* The ep_index in the command desriptors of Stop >> > + * Endpoint command, Set TR Dequeue Pointer command >> > + * and Reset Endpoint command is vaild. so it should >> > + * be match the ep_index of cd. >> > + */ >> > + if ((cmd_type == TRB_STOP_RING) || >> > + (cmd_type == TRB_SET_DEQ) || >> > + (cmd_type == TRB_RESET_EP)) { >> > + ep_index = TRB_TO_EP_INDEX(le32_to_cpu( >> > + cmd_trb->generic.field[3])); >> > + if (ep_index != cur_cd->ep_index) >> > + continue; >> > + } >> > + >> > + /* get cycle state from the origin command trb */ >> > + cycle_state = le32_to_cpu(cmd_trb->generic.field[3]) >> > + & TRB_CYCLE; >> > + >> > + /* modify the command trb to NO OP command */ >> > + cmd_trb->generic.field[0] = 0; >> > + cmd_trb->generic.field[1] = 0; >> > + cmd_trb->generic.field[2] = 0; >> > + cmd_trb->generic.field[3] = cpu_to_le32( >> > + TRB_TYPE(TRB_CMD_NOOP) | cycle_state); >> > + break; >> > + } >> > + } > > This function is just disturbing. What if there are multiple commands > of the same type for the device? You may end up setting the one you > don't want to a no-op. > > Why not change all the callers of xhci_cancel_cmd() to pass in the > pointer to the xhci_trb for that command? They can save the xhci > command ring enqueue pointer after they lock xhci->lock, and before they > call into any command queueing function. You need to set it to the > first TRB in the next segment if the command ring dequeue pointer is > pointing to a link TRB. > > Then you *know* which trb you're canceling, and you don't need to store > the TRB type. That will simplify this function quite a bit, because you > won't need to walk the entire command ring. > >> > +} >> > + >> > +static void cancel_cd(struct xhci_hcd *xhci) > > Please rename this to xhci_cancel_cd. We should have all functions > (even if they're static) start with xhci. I know I'm not completely > consistent with this, but that is my preference. > >> > +{ >> > + struct xhci_cd *cur_cd, *next_cd; >> > + >> > + if (list_empty(&xhci->cancel_cmd_list)) >> > + return; >> > + >> > + list_for_each_entry_safe(cur_cd, next_cd, >> > + &xhci->cancel_cmd_list, cancel_cmd_list) { >> > + xhci_dbg(xhci, "Cancel command, slot %d, cmd type %d, " >> > + "ep_index = %d\n", cur_cd->slot_id, >> > + cur_cd->cmd_type, cur_cd->ep_index); >> > + cd_to_noop(xhci, cur_cd); >> > + >> > + /* Whatever we find the matched command trb, we need to >> > + * release the command descriptor. >> > + */ >> > + list_del(&cur_cd->cancel_cmd_list); >> > + kfree(cur_cd); >> > + } >> > +} > > What happens if there are multiple commands that need to be aborted, and > commands other than the first one is in the waitlist? You need to > handle the fact that multiple aborted commands can be in the waitlist. > > Take this scenario, for instance: > > 1. khubd could be waiting on a Set Address command that has been queued > to the command ring. That command is NOT added to a command wait list. > > 2. As soon as the Set Address command is queued and the xhci->lock is > dropped, the driver for a different device asks for an alternate > interface setting change. This causes a Configure Endpoint command to > be queued to the command ring, and the command is added to the > xhci_virt_device command wait list. > > 3. The Set Address times out, and we start the abort process. The Set > Address command is queued to the canceled commands list, and we write to > the stop command register. > > 4. However, the host takes a long time to stop the command ring. This > allows the Configure Endpoint command to timeout, and be aborted. The > command ring is not stopped, so we queue the Configure Endpoint command > to the canceled command list, write to the stop command register, and > wait on the status register to report the command ring is stopped. > > 5. The command ring finally stops, and an event is generated with the > Command Aborted status. That brings us to handle_stopped_cmd_ring(). > > 6. We handle the aborted Set Address command. That command isn't > associated with a command wait list, as I've stated. *If* the virtual > device command wait list is empty, handle_cmd_in_cmd_wait_list() will > return safely without causing an NULL pointer dereferences when it tries > to remove the command ring. We'll move the dequeue pointer past the > command, and move on to process the other commands in the list. > > 7. cancel_cd processes the aborted Configure Endpoint command. It > doesn't call handle_cmd_in_waitlist(), so the command will remain on the > head of the xhci_virt_device wait list, and the command will never be > freed. > > Once the abort finishes, many of the callers will poison the waitlist > list_head and free the command memory. So you should audit those and > make sure they don't free the command before it is finished being > aborted. > > Also, it's assumed that when a command is completed, the function that's > waiting on that completion will free the xhci_command. However, if that > function has already returned, because the command ring has been > stopped, it won't be able to free the command. You'll need a new > function that unconditionally frees the aborted commands' xhci_command > structures. > > > Ok, here's what I suggest you do. First, modify xhci_cancel_cmd() to > take an xhci_command argument. Commands like Set Address that don't > allocate an xhci_command will just pass a NULL pointer. Store that > pointer in the xhci_cd. Add a new function, similar to > handle_cmd_in_waitlist. Perhaps call it handle_aborted_cmd_in_waitlist. > Make it take a pointer to an xhci_cd. That function should: > > 1. If the xhci_command pointer in xhci_cd is not NULL, remove the > command from the wait list, and free that xhci_command. The function > that allocated the command has stopped waiting on the completion (that's > why it cancelled the command in the first place), so there's no need to > complete the competion in the xhci_command. > > 2. Unconditionally turn the command into a no-op TD. > > 3. Remove the command from the canceled command list, and free the > list_head. > > > On an unrelated note, do you have any code that removes the xhci_cd from > the canceled command list if it completes before the command ring can be > stopped? > > handle_cmd_completion() should probably look through the > xhci->cancel_cmd_list for a matching command before the big switch > statement. If it finds a match, it should call > handle_aborted_cmd_in_waitlist() and goto the inc_deq() call for the > command ring, which will immediately return. The code to look for a > match is really simple if you store the TRB pointer in the xhci_cd like > I suggested. > >> > +static void handle_stopped_cmd_ring(struct xhci_hcd *xhci, >> > + struct xhci_event_cmd *event) >> > +{ >> > + struct xhci_virt_device *virt_dev; >> > + int slot_id = TRB_TO_SLOT_ID( >> > + le32_to_cpu(xhci->cmd_ring->dequeue->generic.field[3])); >> > + int cmd_trb_comp_code = GET_COMP_CODE(le32_to_cpu(event->status)); >> > + >> > + virt_dev = xhci->devs[slot_id]; >> > + if (virt_dev) { >> > + /* if the command is in cmd_list, handle it. */ >> > + handle_cmd_in_cmd_wait_list(xhci, virt_dev, event); >> > + } >> > + >> > + /* advance the dequeue pointer to next cmd trb */ >> > + inc_deq(xhci, xhci->cmd_ring, false); > > Ok, I guess we might end up aborting a perfectly good command here. I > say that because there might be a command on the ring before the command > that we want to abort. We could somehow use the command stop bit > instead of the command abort bit, but I don't think it's feasible. > > There's a race condition between when we decide whether to use the > command stop bit or the command abort bit. We could try to use the > command stop bit if the command we need to abort is not the command that > the command ring dequeue pointer is on. But it's possible that the > command before the one we want to abort has actually completed, but the > event ring handler hasn't got to that event yet. > > Basically, we need to always use the command abort flag instead of the > command stop flag. That makes me sad, but such is life. > >> > + >> > + /* set the cmd ring state */ >> > + if (cmd_trb_comp_code != COMP_CMD_STOP) { >> > + xhci->cmd_ring_state = CMD_RING_STATE_STOPPED; >> > + } else { >> > + cancel_cd(xhci); >> > + xhci->cmd_ring_state = CMD_RING_STATE_RUNNING; >> > + >> > + /* >> > + * ring command ring doorbell again to handle the waiting >> > + * command trbs due to aborting command ring >> > + */ >> > + if (xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) >> > + xhci_ring_cmd_db(xhci); >> > + } >> > +} >> > + > > [snip] > >> > @@ -1394,6 +1402,11 @@ struct xhci_hcd { >> > /* data structures */ >> > struct xhci_device_context_array *dcbaa; >> > struct xhci_ring *cmd_ring; >> > + unsigned int cmd_ring_state; >> > +#define CMD_RING_STATE_RUNNING (1 << 0) >> > +#define CMD_RING_STATE_ABORTED (1 << 1) >> > +#define CMD_RING_STATE_STOPPED (1 << 2) >> >> I don't think you want a bit mask here, since you can't reside in more >> than one of these states at any given point in time. Can it be a enum >> instead? > > An enum would be fine, but we might need to change it in the future if > it turns out we need concurrent states. > > Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html