Re: detected XactErr

Alexei P <alexi0800@xxxxxxxxx> · Mon, 11 Aug 2014 15:54:26 +0000 (UTC)

Alan Stern <stern@...> writes:

> 
> On Wed, 29 Aug 2012, Gary E. Miller wrote:
> 
> > Yo Alan!
> > 
> > On Wed, 29 Aug 2012 17:08:43 -0400 (EDT)
> > Alan Stern <stern@...> wrote:
> > 
> > > > Uh, not easy.  This is a production machine.  Acceptable downtime is
> > > > very small.  There should be a way to turn it off at runtime.
> > > 
> > > No, there shouldn't.  It's a debugging feature; while debugging we 
> > > want to see all occurrences of these messages.
> > > 
> > > Production machines should not run debugging kernels.
> > 
> > We'll have to agree to disagree.  My experience is the fun bugs only happen
> > in production.  A big NASA study years ago proved that after spending $1M
> > per line of code that only half the bugs could be caught before live
> > missions.
> 
> Well, put it this way: Production machines should not run debugging 
> kernels unless you're actually trying to debug something.  For normal 
> operation, a debugging kernel should not be used.
> 
> Remember, the purpose of a debugging kernel is not to provoke bugs or
> notify you that the bugs exist.  Normal kernels do these things just
> fine.  The purpose is to help find the cause of bugs.
> 
> Alan Stern
> 
> 

I believe the focus of this thread has gone in wrong direction. The question
should not be about how to disable the flood of "detected XactErr" messages,
but (a) why does it occur in this particular case, and (b) why there are so
many of them. 
My opinion here is:

(a) The fact of appearance of "detected XactErr" message indicates some
BRUTAL CONDITION with communication pipe on the particular USB port. This
message is generated after the host hardware was unable to receive proper
protocol response from a device after THREE BACK-To-BACK ATTEMPTS (assuming
CERR in EHCI.c is set to 3). This means that this situation is not some
accidental signal glitch but a fatal condition on the pipe. In other words,
the USB device has likely lost its configuration, or went dead. Therefore,
the host should not re-try this low-level transaction, and rather resort to
some higher-level recovery procedure (port reset and re-enumeration). Thus,
we are coming to (b):

(b) Instead of switching to recovery, the Linux USB driver attempts 32
additional re-tries. As explained in (a), these retries serve no purpose,
except they generate really alarming debug logs that would be impossible to
miss. 

Sorry to reviving 2-years-old thread. My problem with Linux USB stack is why
it is doing extra 32 attempts to a dead link. What is the rationale behind
this 32-times "recovery policy"?

Thanks,
Alexei

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html