Re: [PATCH] serial: 8250: Avoid "too much work" from bogus rx timeout interrupt

Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx> · Mon, 19 Dec 2016 19:33:08 +0200

On Mon, 2016-12-19 at 09:12 -0800, Doug Anderson wrote:
> Hi,
> 
> On Mon, Dec 19, 2016 at 4:59 AM, Andy Shevchenko
> <andriy.shevchenko@xxxxxxxxxxxxxxx> wrote:
> > On Sun, 2016-12-18 at 17:14 -0800, Douglas Anderson wrote:
> > > On a Rockchip rk3399-based board during suspend/resume testing, we
> > > found that we could get the console UART into a state where it
> > > would
> > > print this to the console a lot:
> > >   serial8250: too much work for irq42
> > 
> > Have you read the following discussion
> > https://www.spinics.net/lists/kernel/msg2059543.html
> 
> No, I wasn't aware of that discussion.  Yup, basically the exact same
> thing is happening here.  Good to know I'm not alone.  Any idea if the
> Baytrail UART is also based on DesignWare IP?

Yes. Almost all Intel HW is using DesignWare IP for HS UARTs.

> In that thread, Peter said:
> 
> > I think there is every likelihood of spurious RX timeout interrupts
> > tripping this patch, sorry.
> > 
> > Unfortunately, I think UART_BUG_ is the only viable possibility.
> > Or perhaps fixing the port type as PORT_8250 (thus disabling the
> > fifos).
> 
> My change is slightly different than California's in that I'm actually
> throwing away the bogus byte and his patch was treating it as a valid
> byte.  I don't know if that makes the patch more or less palatable.

We need to test, especially in DMA case.

> I would hate to lose access to the FIFOs just due to this weird corner
> case.
> 
> Do we really think there's a case where there's an RX Timeout
> interrupt w/ no "data ready" but that later the data ready will show
> up?  Can you quantify how much later you think it will show up?  If we
> can quantify how much longer the data will show up in then we should
> probably just do a timeout loop right where I added my patch.
> 
> Specifically, here's what's happening today with RX Timeout interrupt
> without "data ready":
> 
> 1. We'll get the interrupt
> 2. We won't do _anything_ to service the interrupt.
> 3. We'll return back to serial8250_interrupt(), where we'll keep
> looping until we get "too much work"
> 4. We'll break out, but the interrupt will still be active.
> 5. Go back to #1
> 
> ...and since this interrupt will keep firing and firing and firing
> with no delay in-between, we'll effectively lock the CPU up.

And the root cause of that is... ?

> If there are some UARTs that eventually get themselves out of this
> state by asserting "data ready" then the above won't be an "infinite"
> loop but it will effectively be a tight loop where we won't let
> userspace run and won't service other interrupts until we actually get
> the data ready.  Since we're already blocking everything else, it
> seems like it might be better to directly loop in
> serial8250_handle_irq() with a timeout of some sort (how long?  100
> us?  1 ms?).  Then we if we get the timeout then we can do the read
> and safely work ourselves free.

What I think is that the root cause of this is still unknown and either
above looks like a hack.

-- 
Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx>
Intel Finland Oy
--
To unsubscribe from this list: send the line "unsubscribe linux-serial" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html