Hi, On 3/15/23 15:47, Ilpo Järvinen wrote: > On Tue, 14 Mar 2023, Hans de Goede wrote: >> On 3/14/23 17:55, Ilpo Järvinen wrote: >>> On Tue, 14 Mar 2023, Hans de Goede wrote: >>>> On 3/14/23 12:48, Ilpo Järvinen wrote: >>>>> On Tue, 14 Mar 2023, Hans de Goede wrote: >>>>>> On 3/14/23 11:55, Ilpo Järvinen wrote: >>>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote: >>>>>>> >>>>>>>> I have spend the last couple of days debugging a problem with Bluetooth >>>>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail >>>>>>>> and Cherry Trail devices. >>>>>>>> >>>>>>>> After much debugging I found out that sometimes the first byte of >>>>>>>> a received packet (data[0]) would be overwritten with a 0 byte. >>>>>>>> >>>>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI: >>>>>>>> >>>>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00 >>>>>>>> >>>>>>>> would sometimes turn into: >>>>>>>> >>>>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00 >>>>>>>> >>>>>>>> Further investigation revealed that this goes away if I stop >>>>>>>> the dw_dmac module from loading, leading to: >>>>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA" >>>>>>>> and the UART working without DMA support. >>>>>>>> >>>>>>>> Testing various kernels showed that this problem was introduced >>>>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to: >>>>>>>> >>>>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA") >>>>>>>> >>>>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem. >>>>>>> >>>>>>>> So it seems that that commit somehow interferes with DMA based >>>>>> >>>>>>> Maybe the the extra interrupt that the tx side change will trigger somehow >>>>>>> causes the confusion on the rx side. So that would be an extra call into >>>>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that >>>>>>> would not occur w/o that tx side change. >>>>>> >>>>>> That sounds like a likely candidate for causing this yes, as said >>>>>> I'm unfamiliar with the serial-port code, but I did already suspect >>>>>> that the change was causing some extra interrupt which somehow >>>>>> interfered with the rx side. >>>>>> >>>>>>>> The issue has been seen on and the revert has been tested on >>>>>>>> the following HW: >>>>>>>> >>>>>>>> Asus T100TA >>>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3 >>>>>>>> >>>>>>>> Lenovo Yoga Tablet 2 1051L >>>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3 >>>>>>>> >>>>>>>> Lenovo Yoga Book X91F >>>>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2 >>>>>>>> >>>>>>>> I have more hw which I believe is affected but these are the models >>>>>>>> where I have done detailed testing. >>>>>>>> >>>>>>>> I would be happy to test any patches, or run a kernel with some extra >>>>>>>> debugging added, just let me know what you need to help fixing this. > >>> From: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> >>> Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt >>> >>> Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART >>> connection failed due corrupted Rx payload. The problem was narrowed >>> down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs >>> despite LSR having DR bit set, which is precondition for attempting to >>> start DMA Rx. >>> >>> This problem became apparent after e8ffbb71f783 ("serial: 8250: use >>> THRE & __stop_tx also with DMA") changed Tx stopping to get triggered >>> using THRI interrupt. >>> >>> Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent >>> interrupt which has Rx related IIR value. >>> >>> Reported-by: Hans de Goede <hdegoede@xxxxxxxxxx> >>> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> >> >> I can confirm that this fixes things for me: >> >> Tested-by: Hans de Goede <hdegoede@xxxxxxxxxx> > > Okay, thanks for testing. > > Here's is a slightly improved debug patch which will count how many > characters are received by DMA and non-DMA rx. It should be tested > WITHOUT the fix. > > I'm mostly interested in knowing if it's purely DMA Rx issue or whether > the non-DMA rx is muddling the waters too. While investigating other > issues I've seen UART_IIR_TIMEOUT (inter-character timeout) to often come > so early from UART that the tail of characters is left there to be handled > by the non-DMA path. Ok, here is a new log: https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue-2 With one successful hci_uart probe and one failed. The failed one looks interesting: [ 576.808776] 8250irq: iir=cc lsr+saved=60 received=1/12 ier=0d dma_t/rx/err=0/1/0 cnt=-4 [ 576.808870] Bluetooth: hci0: Frame reassembly failed (-84) This is the only time that debug message shows anything but 0 for data received over dma. It looks like for all these small packets dma simply never kicks in and the extra IRQ from the THRI interrupt somehow starts a DMA transfer for the one byte and for some reason that DMA transfer always reads the byte as 0. Note that the amount of received bytes is still correct, so the 1 byte transfer by DMA replaces one byte which would be otherwise read from MMIO (I guess its MMIO?), but it has the wrong (all bits 0) content. Regards, Hans