Re: [PATCH v2] cdc-wdm: fix "out-of-sync" due to missing notifications

Bjørn Mork <bjorn@xxxxxxx> · Mon, 04 Jul 2016 19:01:07 +0200

Bjørn Mork <bjorn@xxxxxxx> writes:
> Oliver Neukum <oneukum@xxxxxxxx> writes:
>> On Mon, 2016-07-04 at 13:54 +0200, Bjørn Mork wrote:
>>> Oliver Neukum <oneukum@xxxxxxxx> writes:
>>> 
>
>>> > If you go for that approack you also need to do it
>>> > in resume()
>>> 
>>> Yes, I wondered about that...  But I didn't add it because I am unable
>>> to provoke the problem there, so I cannot really test if it has any
>>> positive effect.  Do you still want it?
>>
>> Yes. I think the window is small, but we don't want strange
>> irreproducible flukes.
>
> OK.  Will post a v3 with the same code in resume once I've done some
> basic testing.

This isn't going very well.

First, this whole thing is a hack for something that should not be an
issue in the first place. Firmware should never queue data without
sending a notification.  The problem does not affect any CDC WDM device,
and I don't think it is a common problem for QMI devices either.  I have
mostly observed it with MBIM.  But there it is very real and extremely
annoying.  I can easily provoke it with two different firmware bases
(from Intel and Qualcomm).

The proposed driver solution is a hack, allowing us to resyncronize
without resetting the firmware. We have no reliable way to reset the
firmware and doing so will have large side effects anyway.  The hack
allows us to continue, with a possibly lost message or two as the only
consequences.  We don't have to break any existing session etc.

But the hack is far from perfect. We race against the firmware every
time it runs.  We risk reading data which the firmware has just prepared
and are sending us a notification for, only to get no data when we
respond to the notification.  This is no problem for any CDC WDM
device.  One of the reads will turn up emtpy, but it doesn't matter if
that's the one before or after the notification.  The spec explicitly
allows empty reads.

But it turns out that it is a real problem for the very same MBIM
firmwares which caused us to need the hack in the first place: They
stall when you try to read and there is no data.  Which means that a
lost race can end up with an EPIPE being returned to _read(), where it
is translated and returned to userspace.

We don't want that.  I consider it an acceptable risk for every open(),
given that I have no better way to do this resyncronization.  But it is
not acceptable that it can happen on every autosuspend sequence, which
is the consequence of adding the hack to resume.  I've now been running
a few hours with it, and the result is occasional unexpected read errors
after resuming.

This is a mess.  I fully agree that it looks like this either should go
into both open() and resume(), or none.  The real answer is 'none'.

But we need the hack for the yucky MBIM firmwares.  Putting it in open()
is defining open() as a syncronization point for something which should
not need syncronization in a perfect world.  I can expand the comment
there a bit to make this clear. Making resume() a syncronization point
as well is only making things worse.

Bjørn

--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html