Alan Stern <stern@...> writes: > > On Mon, 11 Aug 2014, Alexei P wrote: > > > I believe the focus of this thread has gone in wrong direction. The question > > should not be about how to disable the flood of "detected XactErr" messages, > > but (a) why does it occur in this particular case, and (b) why there are so > > many of them. > > My opinion here is: > > > > (a) The fact of appearance of "detected XactErr" message indicates some > > BRUTAL CONDITION with communication pipe on the particular USB port. > > Usually it means that the computer did not receive a reply packet in > response to some packet it sent to the device. This could be because > noise disrupted either the packet sent to the device or the reply > packet, or because the device has crashed and isn't sending anything. > > > This > > message is generated after the host hardware was unable to receive proper > > protocol response from a device after THREE BACK-To-BACK ATTEMPTS (assuming > > CERR in EHCI.c is set to 3). This means that this situation is not some > > accidental signal glitch but a fatal condition on the pipe. > > Not true. People have observed conditions where the transaction errors > occurred multiple times in a row (more than three) and then went away. > There are examples hidden away in the email archives of such > occurrences. You may be able to find some if you read through the > email exchanges leading up to the commit that introduced the > multiple-retry mechanism. > > > In other words, > > the USB device has likely lost its configuration, or went dead. Therefore, > > the host should not re-try this low-level transaction, and rather resort to > > some higher-level recovery procedure (port reset and re-enumeration). Thus, > > we are coming to (b): > > Since the initial assumption above is wrong, this conclusion is > invalid. > > > (b) Instead of switching to recovery, the Linux USB driver attempts 32 > > additional re-tries. As explained in (a), these retries serve no purpose, > > except they generate really alarming debug logs that would be impossible to > > miss. > > They do serve a purpose. Sometimes they are able to re-establish > communication. > > > Sorry to reviving 2-years-old thread. My problem with Linux USB stack is why > > it is doing extra 32 attempts to a dead link. What is the rationale behind > > this 32-times "recovery policy"? > > The number 32 was picked more or less arbitrarily. Experience has > shown, however, that 3 is definitely too small. > > Alan Stern > So, you say that protocol errors occur either to "noise", or "device crash". If it is the "device crash", no amount of transaction retires will lead to recovery. So it leaves the "noise". Under normal definition of "noise", in USB specifications, Section 4.5.1 “Error Detection” indicates that “any glitches will very likely be transient in nature”, and “be close to that of a backplane”. Standard communication noise theory would suggest that if a link is incapable to recover in three consecutive attempts, its probability of recovering must be close to zero. That's why USB designers selected the number "3". When you say that "communication can be re-established", you are, in fact, referring to some marginal cases, where a broken cable (or wrong out-of-standard device/host termination, or unstable clock, or flaky power, or internal hardware bug, or poor firmware implementation of USB protocol engine in a device, etc.) cause flaky link behavior that results in multitude of protocol errors. Therefore, by attempting the whopping 96 retries, you hope to eventually receive right response. In result, the 32x3 retry policy only hides the real problem behind XactErr in a particular questionable link. So, this is not my "wrong initial assumption", it is a question of software philosophy. Sooner or later the flaky marginal link will degrade into a totally failing link. What would you suggest then? To increase the number of re-tries to 255? To insert mdelay(50) between attempts? I think this software philosophy is counter-productive. I think that the USB host software should be able to identify bad flaky links and provide a user (system administrator) with information about link quality. With a general movement towards faster buses (and correspondingly noisier links, USB3 and 3.1 and beyond), the host driver should be able to provide statistics of retries, if any, and report it upon special request. Just as it was long time before PC era, in DEC computers. What would you think about the idea of XactErr performance counters? Regards, -Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html