Bjørn Mork <bjorn@xxxxxxx> writes: > Oliver Neukum <oneukum@xxxxxxxx> writes: >> On Mon, 2016-07-04 at 13:54 +0200, Bjørn Mork wrote: >>> Oliver Neukum <oneukum@xxxxxxxx> writes: >>> > >>> > If you go for that approack you also need to do it >>> > in resume() >>> >>> Yes, I wondered about that... But I didn't add it because I am unable >>> to provoke the problem there, so I cannot really test if it has any >>> positive effect. Do you still want it? >> >> Yes. I think the window is small, but we don't want strange >> irreproducible flukes. > > OK. Will post a v3 with the same code in resume once I've done some > basic testing. This isn't going very well. First, this whole thing is a hack for something that should not be an issue in the first place. Firmware should never queue data without sending a notification. The problem does not affect any CDC WDM device, and I don't think it is a common problem for QMI devices either. I have mostly observed it with MBIM. But there it is very real and extremely annoying. I can easily provoke it with two different firmware bases (from Intel and Qualcomm). The proposed driver solution is a hack, allowing us to resyncronize without resetting the firmware. We have no reliable way to reset the firmware and doing so will have large side effects anyway. The hack allows us to continue, with a possibly lost message or two as the only consequences. We don't have to break any existing session etc. But the hack is far from perfect. We race against the firmware every time it runs. We risk reading data which the firmware has just prepared and are sending us a notification for, only to get no data when we respond to the notification. This is no problem for any CDC WDM device. One of the reads will turn up emtpy, but it doesn't matter if that's the one before or after the notification. The spec explicitly allows empty reads. But it turns out that it is a real problem for the very same MBIM firmwares which caused us to need the hack in the first place: They stall when you try to read and there is no data. Which means that a lost race can end up with an EPIPE being returned to _read(), where it is translated and returned to userspace. We don't want that. I consider it an acceptable risk for every open(), given that I have no better way to do this resyncronization. But it is not acceptable that it can happen on every autosuspend sequence, which is the consequence of adding the hack to resume. I've now been running a few hours with it, and the result is occasional unexpected read errors after resuming. This is a mess. I fully agree that it looks like this either should go into both open() and resume(), or none. The real answer is 'none'. But we need the hack for the yucky MBIM firmwares. Putting it in open() is defining open() as a syncronization point for something which should not need syncronization in a perfect world. I can expand the comment there a bit to make this clear. Making resume() a syncronization point as well is only making things worse. Bjørn -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html