On Mon, 5 Dec 2011, Sarah Sharp wrote: > > > It's true that another interface driver could want to reset the USB > > > device, not just the UAS driver. But is there a way to tell the > > > difference without adding code to the UAS driver? > > > > There _is_ no real difference. For example, it's entirely possible > > that another interface's driver could decide to reset the device at the > > same time as the UAS driver. > > > > On the other hand, it's not a big deal to add a single flag meaning "I > > have decided to reset the device". > > Ok, sure. So your suggestion would be to check that flag, and if it's > not set, attempt to wait for the commands to flush themselves out before > canceling and killing them? I don't know. That's a design decision and it helps to have experience with real-life UAS disks and their problems. You have much more experience with them than I do (namely, zero). > > In that case the other driver would have to wait for the timeout. You > > could improve the situation a little when this occurs by giving the > > existing commands only a short time in which to complete, after which > > you try to abort them. The abort would also need to have a short > > timeout. > > Ok, so for each anchor, I would call usb_wait_anchor_empty_timeout() > with a short timeout, then usb_kill_anchored_urbs()? > > BTW, how short is short? There could be up to 256 commands queued and > we don't want the user to wait for, say, more than a second, so I don't > think the timeout should be bigger than about 3ms? You could use a single timeout of 500 ms (or something like that) for all the queued commands in parallel. Anything that hasn't finished by then gets aborted. > > I guess this depends on your philosophy: When a reset is coming up, do > > you want to allow as many outstanding commands as possible to complete > > normally? Or do you prefer to abort them all immediately? > > Well, by the time we get to a reset (one issued by the UAS driver, not > another driver), the device has already refused to take a particular > command out of its queue by either not answering or returning a failing > status for the ABORT TASK IU. So at that point we know the device has > something wrong with its SCSI queue handling. Do we trust it with other > commands? I'm not sure, but I lean towards aborting all outstanding > commands rather than letting them finish. All right, that's your choice. > > If you do decide to cancel them, then yes, uas_stat_cmplt requires some > > sort of synchronization. But it has to be synchronized with the > > cancellation operation, not with the reset. > > uas_stat_cmplt() needs to be synchronized with the reset operation > because it can cause the work queue to be scheduled. uas_prereset needs > to ensure that work queue is completely stopped before it attempts to > manipulate the workqueue and kill any partially queued URBs. > > > For example, what would happen if the driver were asked to cancel a > > command just as uas_stat_cmplt was called? > > Ok, you're correct that there is a problem there, but I think it's a > separate issue from whether we need SRCU for the work queue sync. Yes, it is separate. At this point I was considering what sort of synchronization you need for uas_stat_cmplt(). > I think the race condition you saw would go something like this: > > - SCSI/block layer attempts to cancel a command for a USB 2.0 UAS > device, and uas_eh_abort_handler() gets past issuing the abort task, > and killing any anchored URBs. > - In interrupt, the status command completes, submits the new data and > status URBs, and returns. > - uas_eh_abort_handler() returns and lets the SCSI core know the task > was successfully aborted. > - Later the two URBs return, and we attempt to complete a command the > SCSI/block layer already returned as canceled. More or less. There are other similar races, such as between sending the data & status URBs vs. sending ABORT-TASK URBs. The real question is when you decide that a command has completed, and I don't remember the details of the protocol well enough to say much more than that. Anyway, if you fix these races you'll find that you have taken care of the pre-reset synchronization as well. > I think that could be worked around by adding an atomic_t or using > set_bit for a flag in the cmnd_info structure. I had noticed the rather > long comment above usb_stor_blocking_completion() in usb-storage, and > had intended to do something similar with the UAS driver. If you have > any advice about how to work around this issue, it would be appreciated. To tell the truth, usb-storage isn't a great model here. It works, but I did most of my work on it before I was fully aware of all the issues involved in synchronizing multiple threads of control. I wouldn't be at all suprised to find there are places where usb-storage is missing memory barriers or something of the sort. Also, usb-storage handles only one command at a time, so it doesn't have to face the same complexity as UAS. Since aborts are uncommon and you don't want to clutter up the fast path, it makes sense to use some form of RCU to synchronize aborts with uas_stat_cmplt(). Once that's done, the pre-reset routine merely has to abort all the commands currently in flight; the abort routine will take care of the synchronization. > > What happens when you decide to cancel a command that has some URBs > > sitting on the workqueue's queue? How do you prevent the workqueue > > from submitting them? > > We take the workqueue lock and remove them from the work queue. If the > workqueue has already re-submitted the URBs and removed them from its > own workqueue, then we'll just wait on them with > usb_kill_anchored_urbs(). > > But it's not so simple in the reset case. We want to stop the work > queue completely and kill any pending URBs, so we have to sync with all > the functions that might schedule the work queue, and prevent any new > URBs from being submitted. But the workqueue gets scheduled from only a few places, right? So once you have prevented new commands from reaching those places (primarily the queuecommand routine -- I don't think commands can come from anywhere else), then all the workqueue needs is the usual abort processing. In other words, once the synchronization for abort is done properly, it turns out that pre-reset needs only to synchronize with queuecommand. And that requires just the host lock. > > > It also isn't taken in the > > > status phase handler (uas_stat_cmplt) that resubmits the second pair of > > > URBs for USB 2.0 UAS devices. I'm not sure if the SCSI host lock can be > > > taken in interrupt context? > > > > It can. However, from the comments above, it's not clear that > > uas_stat_cmplt needs to do anything special. > > It can schedule the workqueue, which is why it needs to sync with > uas_prereset. No, as explained above, uas_stat_cmplt needs to synchronize with the abort handler. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html