Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 16 Mar 2023, Hans de Goede wrote:
> On 3/15/23 15:47, Ilpo Järvinen wrote:
> > On Tue, 14 Mar 2023, Hans de Goede wrote:
> >> On 3/14/23 17:55, Ilpo Järvinen wrote:
> >>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>> On 3/14/23 12:48, Ilpo Järvinen wrote:
> >>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>>>> On 3/14/23 11:55, Ilpo Järvinen wrote:
> >>>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>>>>>
> >>>>>>>> I have spend the last couple of days debugging a problem with Bluetooth
> >>>>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
> >>>>>>>> and Cherry Trail devices.
> >>>>>>>>
> >>>>>>>> After much debugging I found out that sometimes the first byte of
> >>>>>>>> a received packet (data[0]) would be overwritten with a 0 byte.
> >>>>>>>>
> >>>>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
> >>>>>>>>
> >>>>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>>>>>
> >>>>>>>> would sometimes turn into:
> >>>>>>>>
> >>>>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>>>>>
> >>>>>>>> Further investigation revealed that this goes away if I stop
> >>>>>>>> the dw_dmac module from loading, leading to:
> >>>>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
> >>>>>>>> and the UART working without DMA support.
> >>>>>>>>
> >>>>>>>> Testing various kernels showed that this problem was introduced
> >>>>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
> >>>>>>>>
> >>>>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
> >>>>>>>>
> >>>>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
> >>>>>>>
> >>>>>>>> So it seems that that commit somehow interferes with DMA based
> >>>>>>
> >>>>>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
> >>>>>>> causes the confusion on the rx side. So that would be an extra call into 
> >>>>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
> >>>>>>> would not occur w/o that tx side change.
> >>>>>>
> >>>>>> That sounds like a likely candidate for causing this yes, as said
> >>>>>> I'm unfamiliar with the serial-port code, but I did already suspect
> >>>>>> that the change was causing some extra interrupt which somehow
> >>>>>> interfered with the rx side.
> >>>>>>
> >>>>>>>> The issue has been seen on and the revert has been tested on
> >>>>>>>> the following HW:
> >>>>>>>>
> >>>>>>>> Asus T100TA
> >>>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>>>>>
> >>>>>>>> Lenovo Yoga Tablet 2 1051L
> >>>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>>>>>
> >>>>>>>> Lenovo Yoga Book X91F
> >>>>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
> >>>>>>>>
> >>>>>>>> I have more hw which I believe is affected but these are the models
> >>>>>>>> where I have done detailed testing.
> >>>>>>>>
> >>>>>>>> I would be happy to test any patches, or run a kernel with some extra
> >>>>>>>> debugging added, just let me know what you need to help fixing this.
> > 
> >>> From: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>
> >>> Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt
> >>>
> >>> Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
> >>> connection failed due corrupted Rx payload. The problem was narrowed
> >>> down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
> >>> despite LSR having DR bit set, which is precondition for attempting to 
> >>> start DMA Rx.
> >>>
> >>> This problem became apparent after e8ffbb71f783 ("serial: 8250: use
> >>> THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
> >>> using THRI interrupt.
> >>>
> >>> Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
> >>> interrupt which has Rx related IIR value.
> >>>
> >>> Reported-by: Hans de Goede <hdegoede@xxxxxxxxxx>
> >>> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>
> >>
> >> I can confirm that this fixes things for me:
> >>
> >> Tested-by: Hans de Goede <hdegoede@xxxxxxxxxx>
> > 
> > Okay, thanks for testing.
> > 
> > Here's is a slightly improved debug patch which will count how many 
> > characters are received by DMA and non-DMA rx. It should be tested 
> > WITHOUT the fix.
> > 
> > I'm mostly interested in knowing if it's purely DMA Rx issue or whether 
> > the non-DMA rx is muddling the waters too. While investigating other 
> > issues I've seen UART_IIR_TIMEOUT (inter-character timeout) to often come 
> > so early from UART that the tail of characters is left there to be handled 
> > by the non-DMA path.
> 
> Ok, here is a new log:
> 
> https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue-2
> 
> With one successful hci_uart probe and one failed. The failed one looks interesting:
> 
> [  576.808776] 8250irq: iir=cc lsr+saved=60 received=1/12 ier=0d dma_t/rx/err=0/1/0 cnt=-4
> [  576.808870] Bluetooth: hci0: Frame reassembly failed (-84)
> 
> This is the only time that debug message shows anything but 0 for data 
> received over dma.

For the record, there were plenty of =1/6 and a few =224/31.

> It looks like for all these small packets dma simply 
> never kicks in and the extra IRQ from the THRI interrupt somehow starts 
> a DMA transfer for the one byte and for some reason that DMA transfer 
> always reads the byte as 0. 

It would be kind of interesting to know if handle_rx_dma() would never 
call up->dma->rx_dma(up) but return true instead to see if non-DMA Rx gets
a corrupted byte too in the very same scenario (it's not the same test as 
your no DMA test you did initially because only DMA Rx gets disabled so it
hopefully hits the same timing, etc. by keeping DMA Tx on).

> Note that the amount of received bytes is still correct, so the 1 byte 
> transfer by DMA replaces one byte which would be otherwise read from 
> MMIO (I guess its MMIO?), but it has the wrong (all bits 0) content.

Yes, it's certainly is a corruption which shouldn't occur.

I was contemplating whether to apply the fix to all 8250 or just add 
HW specific UART_BUG_xx for this. In the end (especially while looking 
into these most recent debug logs) I think it's generally useful to _not_ 
start DMA on UART_IIR_THRI because it's impossible to make an informed 
decision with it whether to do DMA or not.

I've just sent the cleaned up fix patch out.

Thanks a lot for testing.


-- 
 i.

[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux PPP]     [Linux FS]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Linmodem]     [Device Mapper]     [Linux Kernel for ARM]

  Powered by Linux