Re: [PATCH] cdc-wdm: fix "out-of-sync" due to missing notifications

Bjørn Mork <bjorn@xxxxxxx> · Wed, 18 May 2016 01:39:03 +0200

Oliver Neukum <oneukum@xxxxxxxx> writes:
> On Tue, 2016-05-17 at 21:24 +0200, Bjørn Mork wrote:
>> Oliver Neukum <oneukum@xxxxxxxx> writes:
>> 
>> > On Fri, 2016-05-13 at 18:59 +0200, Bjørn Mork wrote:
>> >> Bjørn Mork <bjorn@xxxxxxx> writes:
>> >> 
>> >> > The driver enforces a strict one-to-one relationship between the
>> >> > received RESPONSE_AVAILABLE notifications and messages read from
>> >> > the device. At the same time, it will cancel the interrupt URB
>> >> > when there is no client holding the character device open.
>> >> 
>> >> Never mind.  Forget it.
>> >> 
>> >> This patch breaks other devices again.  The immediate and unconditional
>> >> reading make them barf. I guess it can be worked around by delaying the
>> >> flushing until at least one notification is received, but I obviously
>> >> have to test this theory thoroughly on all devices I have.
>> >
>> > Hi,
>> >
>> > I think the best approach would be to keep the interrupt URB always
>> > active. I didn't do this to conserve bandwidth, but if it makes devices
>> > work, it certainly would be the best option.
>> 
>> Yes, I considered that.  But this implies purging the device message
>> queue without telling userspace that we did so.  At least with the
>> current driver design, which is based on a single limited size
>> buffer. If the device queues a number of unsolictied messages between
>> two userspace requests, then we really want all those unsolicted
>> messages delivered to the userspace program on the second request.
>
> You might argue that if user space wants the data it should open the
> device.

Maybe.  It's a variant of the current situation, where userspace must
not close the device while a session is in progress.

The issue here is that userspace (and the driver) knows nothing about
what kind of messages the device decides to send, or when.  So how can
userspace know that it wants the data?  It can't.  It has to keep the
device open just in case there is something interesting happening.

This is not the kind of semantics I'd like to present to any userspace
developer.  We present a character device as an abstraction of a
hardware device. I believe a reasonable assumption from a userspace
developer is that the driver forwards all messages it reads from the
hardware to the character device.  So either we don't read from hardware
when the character device is closed, or we cache everything we read
until the character device is open.

>> And I do think the original bandwidth (and power) conservative approach
>> is worth keeping too.  There is no point in waking up these devices
>> unless there actually is an interested userspace application.
>
> They can sleep just fine. I did not imply that runtime PM should
> be disabled.

Yes, which means that we cancel the URBs..  I haven't been able to
reproduce it yet, but I think we might occasionally miss a notification
during suspend/resume too. But this is timing sensitive, and device
timing sensitive, so it's difficult to trigger on purpose.

For now I've ignored it.  But I wouldn't be surprised if we end up
having to do the same "flush queue" excercise on every resume too.

>> FWIW, my initial analysis of the problem with the patch was too quick
>> imprecise. The problem is simply the -EPIPE status we inevitably will
>> hit when the queue is empty, as I should have anticipated. It will be
>> returned to userspace translated to -EIO.  I am currently testing a
>> version taking care of that, and it seems to behave well so far. I'll
>> submit it as soon as I am absoltely sure that it works on all WDM, QMI
>> and MBIM devices I have.  Might take some time, since I am running out
>> of mini-PCIe and m.2 adapters..
>
> That looks a bit risky. Firstly, if you get -EPIPE after a notification
> it is an error and must be reported as such, so you need an additional
> state.

Yes, -EPIPE should be reported if it occurs later when polling after a
notification.  But no additional state is needed.  That info is already
available.

> And what do you do after -EPIPE? Do you clean up the stall
> or not? And the fun really starts if you get a notification while
> you clean the stall.

No cleanup necessary/possible AFAICS:  This is endpoint 0.

> And are you sure all devices can cope with an unsolicited request?

Nope. I am not sure about anything when it comes to USB device firmware
;)

Broad testing is definitely necessary.  But realistically: How can it
possibly fail in other ways than returning 0 data bytes or stalling?

Wait... Don't answer that.  Yes, I know.  Some device will do something
completely wild.  I'm just not sure that it is worth caring about...
The CDC spec isn't exactly clear, but I don't see any restrictions on
the use of GetEncapsulatedResponse there.  On the contrary.  There are
several examples in the spec referring to the case where the device has
no data.  There is nothing identifying this as an error.

AFAICS, the spec allows a strictly polling CDC WDM driver, sending
periodial GetEncapsulatedResponse requests. You don't need to use the
interrupt endpoint if you don't want to.

But the set of specs involved are confusing enough to ensure all sorts
of firmware assumptions.  The GetEncapsulatedResponse request is defined
in USBCDC1.2 without any semantics at all.  This is fixed in the
CDCWMC1.1 spec, which defines the WDM class among other things. It goes
into detail in section 7:

 "The firmware shall interpret GetEncapsulatedResponse as a request to
  read response bytes. The firmware shall send the next wLength bytes
  from the response. The firmware shall allow the host to retrieve data
  using any number of GetEncapsulatedResponse requests. The firmware
  shall return a zero- length reply if there are no data bytes
  available.

  The firmware shall send ResponseAvailable notifications periodically,
  using any appropriate algorithm, to inform the host that there is data
  available in the reply buffer. The firmware is allowed to send
  ResponseAvailable notifications even if there is no data available,
  but this will obviously reduce overall performance."

and also

 "The function shall not return STALL in response to
  GetEncapsulatedResponse."

Unfortunately, the CDCMBIM spec refers only to the USBCDC1.2 definition
with the additional MBIM specific message size restrictions.  It does
not define its own semantics and it does not refer to the CDCWMC1.1
either.  Logically I don't think anyone intended these specs to define
GetEncapsulatedResponse inconsistently. But they didn't enforce
consistence.  Anyone reading just the MBIM spec and it's references will
miss the important examples and clarifying comments in CDCWMC1.1.

And when it comes to QMI devices...  Those are of course only loosely
modelled after CDC ECM, and only the Qualcomm gods know what's hidden in
there.  Could be pretty much anything.  They don't seem to care about
open specs.

Well, whatever.  None of this matters.  What matters is what's
implemented in the devices out there.  So testing, testing and testing.

A summary of what we do know so far:
- Some devices have problems with our current assumption about
  notifications (although CDCWMC1.1 support that assumption).
- Some devices will respond with a stall if they have no data bufferend
  and receive GetEncapsulatedResponse (although they should not
  according to CDCWMC1.1).

It remains to see if there are any devices which cannot cope with an
unexpected GetEncapsulatedResponse.

Bjørn
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html