Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 14 Mar 2023, Hans de Goede wrote:
> On 3/14/23 12:48, Ilpo Järvinen wrote:
> > On Tue, 14 Mar 2023, Hans de Goede wrote:
> >> On 3/14/23 11:55, Ilpo Järvinen wrote:
> >>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>
> >>>> I have spend the last couple of days debugging a problem with Bluetooth
> >>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
> >>>> and Cherry Trail devices.
> >>>>
> >>>> After much debugging I found out that sometimes the first byte of
> >>>> a received packet (data[0]) would be overwritten with a 0 byte.
> >>>>
> >>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
> >>>>
> >>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>
> >>>> would sometimes turn into:
> >>>>
> >>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>
> >>>> Further investigation revealed that this goes away if I stop
> >>>> the dw_dmac module from loading, leading to:
> >>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
> >>>> and the UART working without DMA support.
> >>>>
> >>>> Testing various kernels showed that this problem was introduced
> >>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
> >>>>
> >>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
> >>>>
> >>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
> >>>
> >>> You did something else too than just that because you cannot cleanly 
> >>> revert just e8ffbb71f783. Please indicate what happened to:
> >>>   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
> >>>
> >>> I guess you reverted that too and forgot to mention about it but I just 
> >>> want to be sure we're on the same page?
> >>
> >> I manually fixed up the revert, effectively
> >> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
> >> I did not revert f8d6e9d3ca5c in its entirety.
> >>
> >> I've attached my manual fixed up revert as a patch here.
> >>
> >>>> So it seems that that commit somehow interferes with DMA based
> >>>> data receiving, causing the first byte of a received data transfer
> >>>> to get replaced by 0.
> >>>
> >>> Okay, and you're sure the problem/corruption occurs on the receiving side? 
> >>
> >> What I am sure is that the first byte of a packet has been replaced by 0
> >> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
> >>
> >> I did not dive into the serial-port code side of this problem since
> >> I'm unfamiliar with that.
> >>
> >>> Maybe the the extra interrupt that the tx side change will trigger somehow 
> >>> causes the confusion on the rx side. So that would be an extra call into 
> >>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
> >>> would not occur w/o that tx side change.
> >>
> >> That sounds like a likely candidate for causing this yes, as said
> >> I'm unfamiliar with the serial-port code, but I did already suspect
> >> that the change was causing some extra interrupt which somehow
> >> interfered with the rx side.
> >>
> >>>> The issue has been seen on and the revert has been tested on
> >>>> the following HW:
> >>>>
> >>>> Asus T100TA
> >>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>
> >>>> Lenovo Yoga Tablet 2 1051L
> >>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>
> >>>> Lenovo Yoga Book X91F
> >>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
> >>>>
> >>>> I have more hw which I believe is affected but these are the models
> >>>> where I have done detailed testing.
> >>>>
> >>>> I would be happy to test any patches, or run a kernel with some extra
> >>>> debugging added, just let me know what you need to help fixing this.
> >>>
> >>> How easy this is to trigger in general? (Mainly trying to gauge how 
> >>> easy it will be to find the read and/or the irq that related to the 
> >>> corrupted payload).
> >>
> >> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
> >> calls, re-initializing and re-uploading the BT firmware 10 times and
> >> then looked for "Frame reassembly failed" errors (+ more error caused
> >> by this error) from the drivers/bluetooth/hci_bcm.c code.
> >>
> >> This would usually trigger at least twice with the 10 rmmod + modprobe
> >> calls and typically more often then twice.
> >>
> >> With the revert (and with 5.18.y) I never see a single error even
> >> with more rmmod + modprobe calls. Note it is not just the driver's
> >> probe() which fails, sometimes there are also later errors, this
> >> is just a somewhat convenient way to reproduce.
> >>
> >> So this is somewhat easy to trigger, not trivial to trigger,
> >> but it also does not take waiting many hours before hitting it
> >> once.
> > 
> > Okay, maybe the patch below will manage to filter the interesting part of 
> > it (not tested beyond compiling). It should capture a few irq events after 
> > the THRI is set when DMA Tx completes.
> > 
> > What I'd expect is to have iir value variation between the working and not 
> > working case. Especially, if it's UART_IIR_THRI that shows up that might 
> > be usable as a condition to return from handle_rx_dma() without touching
> > DMA Rx.
> 
> Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without*
> the revert and with your dmesg patch.
> 
> This is (part of) the HCI init without issues:
> 
> [  733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
> [  733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
> [  733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  734.081492] Bluetooth: hci0: BCM: chip id 101
> 
> Here is another *successful* log with slightly different timing.
> 
> [  737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
> [  737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
> [  737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  738.050142] Bluetooth: hci0: BCM: chip id 101
> 
> And here is the same part of the init sequence with an error:
> 
> [  771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
> [  771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
> [  771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4
> [  771.632568] Bluetooth: hci0: Frame reassembly failed (-84)
> 
> Here is a full dmesg from several rmmod + modprobes, for the errors look for
> "Frame reassembly failed".
> 
> Note this is from a different boot, I had to add log_buf_len=50M to be able
> to grab a full dmesg:
> 
> https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue

Thanks. I'll try to come up with another debug patch tomorrow to further 
inspect the payload that will be read after that point.

In the meantime you could try if the following patch is enough 
to solve the issue.


From: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>
Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt

Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
connection failed due corrupted Rx payload. The problem was narrowed
down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
despite LSR having DR bit set, which is precondition for attempting to 
start DMA Rx.

This problem became apparent after e8ffbb71f783 ("serial: 8250: use
THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
using THRI interrupt.

Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
interrupt which has Rx related IIR value.

Reported-by: Hans de Goede <hdegoede@xxxxxxxxxx>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>

---
 drivers/tty/serial/8250/8250_port.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index fa43df05342b..a3ef2f42a06b 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1903,6 +1903,8 @@ EXPORT_SYMBOL_GPL(serial8250_modem_status);
 static bool handle_rx_dma(struct uart_8250_port *up, unsigned int iir)
 {
 	switch (iir & 0x3f) {
+	case UART_IIR_THRI:
+		return false;
 	case UART_IIR_RDI:
 		if (!up->dma->rx_running)
 			break;

-- 
tg: (571079f5ba93..) fix/dma-rx-startup (depends on: tty-next)

[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux PPP]     [Linux FS]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Linmodem]     [Device Mapper]     [Linux Kernel for ARM]

  Powered by Linux