On Thu, Apr 20, 2017 at 2:18 PM, Robert Foss <robert.foss@xxxxxxxxxxxxx> wrote: > Hi Björn, > > Thanks for the thorough and explicit feedback, it was rather helpful. > > > On 2017-04-20 04:32 AM, Bjørn Mork wrote: >> >> Hello Robert, >> >> Sorry for being much too late here, but during recent attemts to debug >> issues caused by my commit 833415a3e781 ("cdc-wdm: fix "out-of-sync" due >> to missing notifications") I believe I found a couple of issues with >> commit c1da59dad0eb. At least one of them is serious (potentional >> GPF_KERNEL allocation in interrupt context). >> >> But I don't know if I understand the problem the commit set out to >> solves, which is why I have no proposed patch here. > > > > The description of the situation that I received was then one > from the commit message: > > "In case of a read error, userspace may not actually read the data, since > the > driver returns an error through wdm_poll. After this, the underlying device > may > attempt to send us more data, but the queue is not processed. While > userspace is > also blocked, because the read error is never cleared." > > So the way I interpret it is that after an error condition the device can > get prevented from sending data, which in turn can cause userspace to not > issue a read (which would have cleared the error). > > I could certainly be wrong in this interpretation or be misunderstanding > something. > > Prathmesh, Guenter: Do you have any comments? > I would not want to claim to fully understand the problem, but from Bjørn's comments it really looks like the patch should be reverted. Thanks, Guenter >> >> >>> @@ -189,7 +192,13 @@ static void wdm_in_callback(struct urb *urb) >>> } >>> } >>> >>> - desc->rerr = status; >>> + /* >>> + * only set a new error if there is no previous error. >>> + * Errors are only cleared during read/open >>> + */ >>> + if (desc->rerr == 0) >>> + desc->rerr = status; >>> + >>> if (length + desc->length > desc->wMaxCommand) { >>> /* The buffer would overflow */ >>> set_bit(WDM_OVERFLOW, &desc->flags); >>> >> >> >> This is minor, but in my opinion this change is flawed. It changes the >> meaning of "desc->rerr" from the original "last status code" to the new >> "first error since last userspace read". >> >> wdm_read will only return and clear such errors if desc->length == 0 on >> entry. This is up to the userspace application to decide. I claim that >> any error should either be reported immediately or not at all. Caching >> errors forever like this, and then reporting them to userspace at some >> arbitrary later event, is pointless. >> >> The existing code did the correct thing in the absence of proper event >> queing. The only real alternative would be to completely stop >> processing device events until userspace has seen the error. But that >> would just make things fail completely for no good reason. >> >> We should probably consider making a queue/list of the read buffers and >> keep each status code with the associated list element. Then we could >> report back the error when userspace reads the element that caused it. > > > I see your point about changing the semantics of desc->rerr, and agree that > it probably isn't desirable. > > So an event queue is starting to look like the way forward. > >> >>> @@ -221,9 +230,19 @@ static void wdm_in_callback(struct urb *urb) >>> } >>> >>> skip_error: >>> + set_bit(WDM_READ, &desc->flags); >>> wake_up(&desc->wait); >>> >>> - set_bit(WDM_READ, &desc->flags); >>> + if (desc->rerr) { >>> + /* >>> + * Since there was an error, userspace may decide to not >>> read >>> + * any data after poll'ing. >>> + * We should respond to further attempts from the device >>> to send >>> + * data, so that we can get unstuck. >>> + */ >>> + service_outstanding_interrupt(desc); >>> + } >>> + >>> unlock: >>> spin_unlock(&desc->iuspin); >>> } >> >> >> We cannot do this. This is an URB callback, and >> service_outstanding_interrupt() >> calls usb_submit_urb(...,GFP_KERNEL). >> >> This issue could of course be avoided by simply changing the allocation >> flags. But looking at the comment here, I am pretty sure that this is >> the wrong solution to the unanticipated userspace behaviour. If >> userspace decides not to read() if poll() sets POLLERR, then it *should >> not* expect the error condition to be cleared. Or am I missing something? > > > The allocation flags aside, calling poll() and it returning POLLERR should > not clear the error, so the behavior of the > >> >> In any case: This will not unstick anything, given the previous problem >> with desc->rerr not being updated unless userspace reads. You will just >> flush the device queue, but the error condition is still there. So >> poll() will still set POLLERR. If the comment is correct, this is a >> permanent deadlock. >> >> userspace can avoid tge issue by reading the descriptor on POLLERR. If >> that is not the way this is expected to work (I am by no means a POSIX >> expert), then I believe the correct fix must be to change wdm_poll(), >> either to clear the error condition or to not set POLLERR in this case. > > > So, in lieu of an actual event queue, maybe this patch should be reverted. > Thoughts? > > > Rob. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html