Re: issues with commit c1da59dad0eb ("cdc-wdm: Clear read pipeline in case of error")

Guenter Roeck <groeck@xxxxxxxxxx> · Fri, 21 Apr 2017 21:59:19 -0700

On Thu, Apr 20, 2017 at 2:18 PM, Robert Foss <robert.foss@xxxxxxxxxxxxx> wrote:
> Hi Björn,
>
> Thanks for the thorough and explicit feedback, it was rather helpful.
>
>
> On 2017-04-20 04:32 AM, Bjørn Mork wrote:
>>
>> Hello Robert,
>>
>> Sorry for being much too late here, but during recent attemts to debug
>> issues caused by my commit 833415a3e781 ("cdc-wdm: fix "out-of-sync" due
>> to missing notifications") I believe I found a couple of issues with
>> commit c1da59dad0eb. At least one of them is serious (potentional
>> GPF_KERNEL allocation in interrupt context).
>>
>> But I don't know if I understand the problem the commit set out to
>> solves, which is why I have no proposed patch here.
>
>
>
> The description of the situation that I received was then one
> from the commit message:
>
> "In case of a read error, userspace may not actually read the data, since
> the
> driver returns an error through wdm_poll. After this, the underlying device
> may
> attempt to send us more data, but the queue is not processed. While
> userspace is
> also blocked, because the read error is never cleared."
>
> So the way I interpret it is that after an error condition the device can
> get prevented from sending data, which in turn can cause userspace to not
> issue a read (which would have cleared the error).
>
> I could certainly be wrong in this interpretation or be misunderstanding
> something.
>
> Prathmesh, Guenter: Do you have any comments?
>

I would not want to claim to fully understand the problem, but from
Bjørn's comments it really looks like the patch should be reverted.

Thanks,
Guenter

>>
>>
>>> @@ -189,7 +192,13 @@ static void wdm_in_callback(struct urb *urb)
>>>                 }
>>>         }
>>>
>>> -       desc->rerr = status;
>>> +       /*
>>> +        * only set a new error if there is no previous error.
>>> +        * Errors are only cleared during read/open
>>> +        */
>>> +       if (desc->rerr  == 0)
>>> +               desc->rerr = status;
>>> +
>>>         if (length + desc->length > desc->wMaxCommand) {
>>>                 /* The buffer would overflow */
>>>                 set_bit(WDM_OVERFLOW, &desc->flags);
>>>
>>
>>
>> This is minor, but in my opinion this change is flawed.  It changes the
>> meaning of "desc->rerr" from the original "last status code" to the new
>> "first error since last userspace read".
>>
>> wdm_read will only return and clear such errors if desc->length == 0 on
>> entry.  This is up to the userspace application to decide. I claim that
>> any error should either be reported immediately or not at all.  Caching
>> errors forever like this, and then reporting them to userspace at some
>> arbitrary later event, is pointless.
>>
>> The existing code did the correct thing in the absence of proper event
>> queing.  The only real alternative would be to completely stop
>> processing device events until userspace has seen the error.  But that
>> would just make things fail completely for no good reason.
>>
>> We should probably consider making a queue/list of the read buffers and
>> keep each status code with the associated list element.  Then we could
>> report back the error when userspace reads the element that caused it.
>
>
> I see your point about changing the semantics of desc->rerr, and agree that
> it probably isn't desirable.
>
> So an event queue is starting to look like the way forward.
>
>>
>>> @@ -221,9 +230,19 @@ static void wdm_in_callback(struct urb *urb)
>>>         }
>>>
>>>  skip_error:
>>> +       set_bit(WDM_READ, &desc->flags);
>>>         wake_up(&desc->wait);
>>>
>>> -       set_bit(WDM_READ, &desc->flags);
>>> +       if (desc->rerr) {
>>> +               /*
>>> +                * Since there was an error, userspace may decide to not
>>> read
>>> +                * any data after poll'ing.
>>> +                * We should respond to further attempts from the device
>>> to send
>>> +                * data, so that we can get unstuck.
>>> +                */
>>> +               service_outstanding_interrupt(desc);
>>> +       }
>>> +
>>>  unlock:
>>>         spin_unlock(&desc->iuspin);
>>>   }
>>
>>
>> We cannot do this.  This is an URB callback, and
>> service_outstanding_interrupt()
>> calls  usb_submit_urb(...,GFP_KERNEL).
>>
>> This issue could of course be avoided by simply changing the allocation
>> flags. But looking at the comment here, I am pretty sure that this is
>> the wrong solution to the unanticipated userspace behaviour.  If
>> userspace decides not to read() if poll() sets POLLERR, then it *should
>> not* expect the error condition to be cleared. Or am I missing something?
>
>
> The allocation flags aside, calling poll() and it returning POLLERR should
> not clear the error, so the behavior of the
>
>>
>> In any case: This will not unstick anything, given the previous problem
>> with desc->rerr not being updated unless userspace reads.  You will just
>> flush the device queue, but the error condition is still there. So
>> poll() will still set POLLERR.  If the comment is correct, this is a
>> permanent deadlock.
>>
>> userspace can avoid tge issue by reading the descriptor on POLLERR. If
>> that is not the way this is expected to work (I am by no means a POSIX
>> expert), then I believe the correct fix must be to change wdm_poll(),
>> either to clear the error condition or to not set POLLERR in this case.
>
>
> So, in lieu of an actual event queue, maybe this patch should be reverted.
> Thoughts?
>
>
> Rob.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html