Hi Elric, Are you going to have time to rework this patch? I haven't heard back from you for a week. If I don't hear from you in a couple days, I'll assume you're not checking your email. Alex, would you consider making the changes yourself if that happens? Comments below. On Fri, Jun 01, 2012 at 02:39:05PM +0300, Alexander Shishkin wrote: > Elric Fu <elricfu1@xxxxxxxxx> writes: > > Hi Sarah, > > > > please test it in your old VIA USB3.0 hub. I can't duplicate > > the case of the address device command timeout in my VIA > > USB3.0 hub. I heard that the GENESYS GL3310 sata bridge would > > respond the address device command after a long time when it > > was plugged into VIA xHCI host. And the issue will occur after > > several quick disconnect/connect cycles. Yesterday I finally > > got a GL3310 device. But it is ok and responds the address > > device command very quickly. I can't duplicate it. > > > > In additional, any comment about the patch? > > Hi, > > I've tested this patch (it requires some rediffing) with a UDC that I > hacked to ignore SET_ADDRESS requests and it does fix the problem. Some > comments below. First, in general, I would suggest you only manipulate the command ring state while holding the xhci->lock. Since we can have multiple outstanding commands that need to be aborted, you need to hold the lock in places like this: temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); if (!(temp_64 & CMD_RING_RUNNING)) { xhci_dbg(xhci, "Command ring had been stopped\n"); return -EINVAL; } xhci->cmd_ring_state = CMD_RING_STATE_ABORTED; xhci_write_64(xhci, temp_64 | CMD_RING_ABORT, &xhci->op_regs->cmd_ring); And what happens if you have multiple commands that need to be aborted, and the abort handlers race with each other? You don't have that problem now, because this patch only adds support for aborting the Set Address command. However, I also want to abort other commands, like Configure Endpoint, etc. This patch needs to be modified to cover several corner cases: 1. Each device can have commands outstanding independent of whether other devices have commands queued. 2. Each device can have multiple commands queued for it at the same time. 3. The same type of command could be queued for the same device. It may not make sense for most of the commands, but I want to future proof this code in case that ever becomes true. 4. Commands could have different timeout values, or be aborted early because of a signal from userspace. Therefore, the command we need to abort may not be the command that is pointed to by the command ring dequeue pointer. 5. Aborted commands need to have all memory for those commands freed by the stopped command ring handler. Otherwise we race between the function that issued the command ring abort and the handler. This means the stopped command ring handler needs to free the xhci_command for any aborted commands, and remove the command from the xhci_virt_device command wait list. > [snip] > > +int xhci_abort_cmd_ring(struct xhci_hcd *xhci) > > Shouldn't this be static? > > > +{ > > + u64 temp_64; > > + int ret; > > + > > + xhci_dbg(xhci, "Abort command ring\n"); > > + > > + if (!(xhci->cmd_ring_state & CMD_RING_STATE_RUNNING)) { > > + xhci_dbg(xhci, "The command ring isn't running, " > > + "Have the command ring been stopped?"); > > + return -EINVAL; > > + } > > + > > + temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); > > + if (!(temp_64 & CMD_RING_RUNNING)) { > > + xhci_dbg(xhci, "Command ring had been stopped\n"); > > + return -EINVAL; > > + } Don't return an error code here. It's possible that two different command aborts will race with each other, and you should be able to handle that case. If you return -EINVAL here, xhci_cancel_cmd() will tell the USB core that the host controller died, which is not what you want. Instead, have xhci_cancel_cmd() add the command to the command list under the xHCI lock, and read the command ring status while you're holding that lock. Drop the lock, and take a look at the command ring status. If it was stopped, you know handle_stopped_cmd_ring() hadn't run yet when you queued the command to the cancel list (possibly because the event ring handler hadn't got to the command stopped event TRB yet). But you do know it *will* run and handle your command. So if the command ring status was stopped, don't call xhci_abort_cmd_ring(), and just return. > > + xhci->cmd_ring_state = CMD_RING_STATE_ABORTED; > > + xhci_write_64(xhci, temp_64 | CMD_RING_ABORT, > > + &xhci->op_regs->cmd_ring); > > + > > + /* Section 4.6.1.2 of xHCI 1.0 spec says software should > > + * time the completion od all xHCI commands, including > > + * the Command Abort operation. If software doesn't see > > + * CRR negated in a timely manner (e.g. longer than 5 > > + * seconds), then it should assume that the there are > > + * larger problems with the xHC and assert HCRST. > > + */ > > + ret = handshake(xhci, &xhci->op_regs->cmd_ring, > > + CMD_RING_RUNNING, 0, 5 * 1000 * 1000); > > + if (ret < 0) { > > + xhci_err(xhci, "Stopped the command ring failed, " > > + "maybe the host is dead\n"); > > + xhci->xhc_state |= XHCI_STATE_DYING; > > + xhci_quiesce(xhci); > > + xhci_halt(xhci); > > + return -ESHUTDOWN; > > + } > > + > > + return 0; > > +} > > + > > +static int queue_cd(struct xhci_hcd *xhci, int slot_id, > > + u32 cmd_type, int ep_index) > > +{ > > + struct xhci_cd *cd; > > + > > + cd = kzalloc(sizeof(struct xhci_cd), GFP_ATOMIC); > > + if (!cd) > > + return -ENOMEM; > > + INIT_LIST_HEAD(&cd->cancel_cmd_list); > > + > > + cd->slot_id = slot_id; > > + cd->cmd_type = cmd_type; > > + cd->ep_index = ep_index; > > + list_add_tail(&cd->cancel_cmd_list, &xhci->cancel_cmd_list); > > + > > + return 0; > > +} > > + > > +int xhci_cancel_cmd(struct xhci_hcd *xhci, int slot_id, > > + u32 cmd_type, int ep_index) > > +{ > > + int retval; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&xhci->lock, flags); > > It looks like xhci_abort_cmd_ring() can race with xhci_ring_cmd_db(), > since there doesn't seem to be a requirement for the latter to be called > under &xhci->lock. I found one case of this in xhci_run_finished() for > controllers with NEC quirk enabled. > > > + > > + if (xhci->xhc_state & XHCI_STATE_DYING) { > > + xhci_warn(xhci, "Abort the command ring," > > + " but the xHCI is dead.\n"); > > + spin_unlock_irqrestore(&xhci->lock, flags); > > + return -ESHUTDOWN; > > + } > > + > > + /* queue the cmd desriptor to cancelled_cd_list */ > > + retval = queue_cd(xhci, slot_id, cmd_type, ep_index); > > + if (retval) { > > + spin_unlock_irqrestore(&xhci->lock, flags); > > + return retval; > > + } > > + > > + /* abort command ring */ > > + retval = xhci_abort_cmd_ring(xhci); > > + if (retval) { > > + xhci_err(xhci, "Abort command ring failed\n"); > > + spin_unlock_irqrestore(&xhci->lock, flags); > > + if (retval == -ESHUTDOWN) { > > + usb_hc_died(xhci_to_hcd(xhci)->primary_hcd); > > + xhci_dbg(xhci, "xHCI host controller is dead.\n"); > > + } > > + return retval; > > + } > > + > > + spin_unlock_irqrestore(&xhci->lock, flags); > > + return 0; > > +} > > + > > void xhci_ring_ep_doorbell(struct xhci_hcd *xhci, > > unsigned int slot_id, > > unsigned int ep_index, > > @@ -1050,6 +1152,127 @@ static int handle_cmd_in_cmd_wait_list(struct xhci_hcd *xhci, > > return 1; > > } > > > > +/* > > + * Find the command trb need to be cancelled and modify it to NO OP > > + * command. > > + * > > + * If we can't find the command trb, we think it had already been > > + * executed. > > + */ > > +static void cd_to_noop(struct xhci_hcd *xhci, struct xhci_cd *cur_cd) > > +{ > > + struct xhci_segment *cur_seg; > > + union xhci_trb *cmd_trb; > > + int slot_id; > > + u32 cmd_type; > > + u32 cycle_state; > > + int ep_index; > > + > > + if (xhci->cmd_ring->dequeue == xhci->cmd_ring->enqueue) > > + return; > > + > > + /* find the current segmene of command ring */ > > + cur_seg = find_trb_seg(xhci->cmd_ring->first_seg, > > + xhci->cmd_ring->dequeue, &cycle_state); > > + > > + /* find the command trb matched by CD from command ring */ > > + for (cmd_trb = xhci->cmd_ring->dequeue; > > + cmd_trb != xhci->cmd_ring->enqueue; > > + next_trb(xhci, xhci->cmd_ring, &cur_seg, &cmd_trb)) { > > + if (TRB_TYPE_LINK_LE32(cmd_trb->generic.field[3])) > > + continue; > > + > > + slot_id = TRB_TO_SLOT_ID( > > + le32_to_cpu(cmd_trb->generic.field[3])); > > + cmd_type = TRB_FIELD_TO_TYPE( > > + le32_to_cpu(cmd_trb->generic.field[3])); > > + if ((slot_id == cur_cd->slot_id) && > > + (cmd_type == cur_cd->cmd_type)) { > > + > > + /* The ep_index in the command desriptors of Stop > > + * Endpoint command, Set TR Dequeue Pointer command > > + * and Reset Endpoint command is vaild. so it should > > + * be match the ep_index of cd. > > + */ > > + if ((cmd_type == TRB_STOP_RING) || > > + (cmd_type == TRB_SET_DEQ) || > > + (cmd_type == TRB_RESET_EP)) { > > + ep_index = TRB_TO_EP_INDEX(le32_to_cpu( > > + cmd_trb->generic.field[3])); > > + if (ep_index != cur_cd->ep_index) > > + continue; > > + } > > + > > + /* get cycle state from the origin command trb */ > > + cycle_state = le32_to_cpu(cmd_trb->generic.field[3]) > > + & TRB_CYCLE; > > + > > + /* modify the command trb to NO OP command */ > > + cmd_trb->generic.field[0] = 0; > > + cmd_trb->generic.field[1] = 0; > > + cmd_trb->generic.field[2] = 0; > > + cmd_trb->generic.field[3] = cpu_to_le32( > > + TRB_TYPE(TRB_CMD_NOOP) | cycle_state); > > + break; > > + } > > + } This function is just disturbing. What if there are multiple commands of the same type for the device? You may end up setting the one you don't want to a no-op. Why not change all the callers of xhci_cancel_cmd() to pass in the pointer to the xhci_trb for that command? They can save the xhci command ring enqueue pointer after they lock xhci->lock, and before they call into any command queueing function. You need to set it to the first TRB in the next segment if the command ring dequeue pointer is pointing to a link TRB. Then you *know* which trb you're canceling, and you don't need to store the TRB type. That will simplify this function quite a bit, because you won't need to walk the entire command ring. > > +} > > + > > +static void cancel_cd(struct xhci_hcd *xhci) Please rename this to xhci_cancel_cd. We should have all functions (even if they're static) start with xhci. I know I'm not completely consistent with this, but that is my preference. > > +{ > > + struct xhci_cd *cur_cd, *next_cd; > > + > > + if (list_empty(&xhci->cancel_cmd_list)) > > + return; > > + > > + list_for_each_entry_safe(cur_cd, next_cd, > > + &xhci->cancel_cmd_list, cancel_cmd_list) { > > + xhci_dbg(xhci, "Cancel command, slot %d, cmd type %d, " > > + "ep_index = %d\n", cur_cd->slot_id, > > + cur_cd->cmd_type, cur_cd->ep_index); > > + cd_to_noop(xhci, cur_cd); > > + > > + /* Whatever we find the matched command trb, we need to > > + * release the command descriptor. > > + */ > > + list_del(&cur_cd->cancel_cmd_list); > > + kfree(cur_cd); > > + } > > +} What happens if there are multiple commands that need to be aborted, and commands other than the first one is in the waitlist? You need to handle the fact that multiple aborted commands can be in the waitlist. Take this scenario, for instance: 1. khubd could be waiting on a Set Address command that has been queued to the command ring. That command is NOT added to a command wait list. 2. As soon as the Set Address command is queued and the xhci->lock is dropped, the driver for a different device asks for an alternate interface setting change. This causes a Configure Endpoint command to be queued to the command ring, and the command is added to the xhci_virt_device command wait list. 3. The Set Address times out, and we start the abort process. The Set Address command is queued to the canceled commands list, and we write to the stop command register. 4. However, the host takes a long time to stop the command ring. This allows the Configure Endpoint command to timeout, and be aborted. The command ring is not stopped, so we queue the Configure Endpoint command to the canceled command list, write to the stop command register, and wait on the status register to report the command ring is stopped. 5. The command ring finally stops, and an event is generated with the Command Aborted status. That brings us to handle_stopped_cmd_ring(). 6. We handle the aborted Set Address command. That command isn't associated with a command wait list, as I've stated. *If* the virtual device command wait list is empty, handle_cmd_in_cmd_wait_list() will return safely without causing an NULL pointer dereferences when it tries to remove the command ring. We'll move the dequeue pointer past the command, and move on to process the other commands in the list. 7. cancel_cd processes the aborted Configure Endpoint command. It doesn't call handle_cmd_in_waitlist(), so the command will remain on the head of the xhci_virt_device wait list, and the command will never be freed. Once the abort finishes, many of the callers will poison the waitlist list_head and free the command memory. So you should audit those and make sure they don't free the command before it is finished being aborted. Also, it's assumed that when a command is completed, the function that's waiting on that completion will free the xhci_command. However, if that function has already returned, because the command ring has been stopped, it won't be able to free the command. You'll need a new function that unconditionally frees the aborted commands' xhci_command structures. Ok, here's what I suggest you do. First, modify xhci_cancel_cmd() to take an xhci_command argument. Commands like Set Address that don't allocate an xhci_command will just pass a NULL pointer. Store that pointer in the xhci_cd. Add a new function, similar to handle_cmd_in_waitlist. Perhaps call it handle_aborted_cmd_in_waitlist. Make it take a pointer to an xhci_cd. That function should: 1. If the xhci_command pointer in xhci_cd is not NULL, remove the command from the wait list, and free that xhci_command. The function that allocated the command has stopped waiting on the completion (that's why it cancelled the command in the first place), so there's no need to complete the competion in the xhci_command. 2. Unconditionally turn the command into a no-op TD. 3. Remove the command from the canceled command list, and free the list_head. On an unrelated note, do you have any code that removes the xhci_cd from the canceled command list if it completes before the command ring can be stopped? handle_cmd_completion() should probably look through the xhci->cancel_cmd_list for a matching command before the big switch statement. If it finds a match, it should call handle_aborted_cmd_in_waitlist() and goto the inc_deq() call for the command ring, which will immediately return. The code to look for a match is really simple if you store the TRB pointer in the xhci_cd like I suggested. > > +static void handle_stopped_cmd_ring(struct xhci_hcd *xhci, > > + struct xhci_event_cmd *event) > > +{ > > + struct xhci_virt_device *virt_dev; > > + int slot_id = TRB_TO_SLOT_ID( > > + le32_to_cpu(xhci->cmd_ring->dequeue->generic.field[3])); > > + int cmd_trb_comp_code = GET_COMP_CODE(le32_to_cpu(event->status)); > > + > > + virt_dev = xhci->devs[slot_id]; > > + if (virt_dev) { > > + /* if the command is in cmd_list, handle it. */ > > + handle_cmd_in_cmd_wait_list(xhci, virt_dev, event); > > + } > > + > > + /* advance the dequeue pointer to next cmd trb */ > > + inc_deq(xhci, xhci->cmd_ring, false); Ok, I guess we might end up aborting a perfectly good command here. I say that because there might be a command on the ring before the command that we want to abort. We could somehow use the command stop bit instead of the command abort bit, but I don't think it's feasible. There's a race condition between when we decide whether to use the command stop bit or the command abort bit. We could try to use the command stop bit if the command we need to abort is not the command that the command ring dequeue pointer is on. But it's possible that the command before the one we want to abort has actually completed, but the event ring handler hasn't got to that event yet. Basically, we need to always use the command abort flag instead of the command stop flag. That makes me sad, but such is life. > > + > > + /* set the cmd ring state */ > > + if (cmd_trb_comp_code != COMP_CMD_STOP) { > > + xhci->cmd_ring_state = CMD_RING_STATE_STOPPED; > > + } else { > > + cancel_cd(xhci); > > + xhci->cmd_ring_state = CMD_RING_STATE_RUNNING; > > + > > + /* > > + * ring command ring doorbell again to handle the waiting > > + * command trbs due to aborting command ring > > + */ > > + if (xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) > > + xhci_ring_cmd_db(xhci); > > + } > > +} > > + [snip] > > @@ -1394,6 +1402,11 @@ struct xhci_hcd { > > /* data structures */ > > struct xhci_device_context_array *dcbaa; > > struct xhci_ring *cmd_ring; > > + unsigned int cmd_ring_state; > > +#define CMD_RING_STATE_RUNNING (1 << 0) > > +#define CMD_RING_STATE_ABORTED (1 << 1) > > +#define CMD_RING_STATE_STOPPED (1 << 2) > > I don't think you want a bit mask here, since you can't reside in more > than one of these states at any given point in time. Can it be a enum > instead? An enum would be fine, but we might need to change it in the future if it turns out we need concurrent states. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html