Hi, On 3/14/23 12:48, Ilpo Järvinen wrote: > On Tue, 14 Mar 2023, Hans de Goede wrote: >> On 3/14/23 11:55, Ilpo Järvinen wrote: >>> On Tue, 14 Mar 2023, Hans de Goede wrote: >>> >>>> I have spend the last couple of days debugging a problem with Bluetooth >>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail >>>> and Cherry Trail devices. >>>> >>>> After much debugging I found out that sometimes the first byte of >>>> a received packet (data[0]) would be overwritten with a 0 byte. >>>> >>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI: >>>> >>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00 >>>> >>>> would sometimes turn into: >>>> >>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00 >>>> >>>> Further investigation revealed that this goes away if I stop >>>> the dw_dmac module from loading, leading to: >>>> "dw-apb-uart 80860F0A:00: failed to request DMA" >>>> and the UART working without DMA support. >>>> >>>> Testing various kernels showed that this problem was introduced >>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to: >>>> >>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA") >>>> >>>> And reverting that on top of v6.3-rc2 indeed solves the problem. >>> >>> You did something else too than just that because you cannot cleanly >>> revert just e8ffbb71f783. Please indicate what happened to: >>> f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races") >>> >>> I guess you reverted that too and forgot to mention about it but I just >>> want to be sure we're on the same page? >> >> I manually fixed up the revert, effectively >> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c >> I did not revert f8d6e9d3ca5c in its entirety. >> >> I've attached my manual fixed up revert as a patch here. >> >>>> So it seems that that commit somehow interferes with DMA based >>>> data receiving, causing the first byte of a received data transfer >>>> to get replaced by 0. >>> >>> Okay, and you're sure the problem/corruption occurs on the receiving side? >> >> What I am sure is that the first byte of a packet has been replaced by 0 >> by the time drivers/tty/tty_buffer.c: receive_buf() gets called. >> >> I did not dive into the serial-port code side of this problem since >> I'm unfamiliar with that. >> >>> Maybe the the extra interrupt that the tx side change will trigger somehow >>> causes the confusion on the rx side. So that would be an extra call into >>> handle_rx_dma() that could either do an extra flush or start DMA Rx that >>> would not occur w/o that tx side change. >> >> That sounds like a likely candidate for causing this yes, as said >> I'm unfamiliar with the serial-port code, but I did already suspect >> that the change was causing some extra interrupt which somehow >> interfered with the rx side. >> >>>> The issue has been seen on and the revert has been tested on >>>> the following HW: >>>> >>>> Asus T100TA >>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3 >>>> >>>> Lenovo Yoga Tablet 2 1051L >>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3 >>>> >>>> Lenovo Yoga Book X91F >>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2 >>>> >>>> I have more hw which I believe is affected but these are the models >>>> where I have done detailed testing. >>>> >>>> I would be happy to test any patches, or run a kernel with some extra >>>> debugging added, just let me know what you need to help fixing this. >>> >>> How easy this is to trigger in general? (Mainly trying to gauge how >>> easy it will be to find the read and/or the irq that related to the >>> corrupted payload). >> >> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart" >> calls, re-initializing and re-uploading the BT firmware 10 times and >> then looked for "Frame reassembly failed" errors (+ more error caused >> by this error) from the drivers/bluetooth/hci_bcm.c code. >> >> This would usually trigger at least twice with the 10 rmmod + modprobe >> calls and typically more often then twice. >> >> With the revert (and with 5.18.y) I never see a single error even >> with more rmmod + modprobe calls. Note it is not just the driver's >> probe() which fails, sometimes there are also later errors, this >> is just a somewhat convenient way to reproduce. >> >> So this is somewhat easy to trigger, not trivial to trigger, >> but it also does not take waiting many hours before hitting it >> once. > > Okay, maybe the patch below will manage to filter the interesting part of > it (not tested beyond compiling). It should capture a few irq events after > the THRI is set when DMA Tx completes. > > What I'd expect is to have iir value variation between the working and not > working case. Especially, if it's UART_IIR_THRI that shows up that might > be usable as a condition to return from handle_rx_dma() without touching > DMA Rx. Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without* the revert and with your dmesg patch. This is (part of) the HCI init without issues: [ 733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator [ 733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator [ 733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 734.081492] Bluetooth: hci0: BCM: chip id 101 Here is another *successful* log with slightly different timing. [ 737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator [ 737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator [ 737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 738.050142] Bluetooth: hci0: BCM: chip id 101 And here is the same part of the init sequence with an error: [ 771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator [ 771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator [ 771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4 [ 771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5 [ 771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4 [ 771.632568] Bluetooth: hci0: Frame reassembly failed (-84) Here is a full dmesg from several rmmod + modprobes, for the errors look for "Frame reassembly failed". Note this is from a different boot, I had to add log_buf_len=50M to be able to grab a full dmesg: https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue Regards, Hans > > From: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> > Subject: [PATCH] DEBUG: DMA Rx problem > > A DEBUG PATCH, not intended for upstream inclusing. > > --- > drivers/tty/serial/8250/8250_dma.c | 4 +++- > drivers/tty/serial/8250/8250_port.c | 13 +++++++++++++ > include/linux/serial_8250.h | 1 + > 3 files changed, 17 insertions(+), 1 deletion(-) > > diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c > index 7fa66501792d..20d54abdded2 100644 > --- a/drivers/tty/serial/8250/8250_dma.c > +++ b/drivers/tty/serial/8250/8250_dma.c > @@ -32,8 +32,10 @@ static void __dma_tx_complete(void *param) > uart_write_wakeup(&p->port); > > ret = serial8250_tx_dma(p); > - if (ret || !dma->tx_running) > + if (ret || !dma->tx_running) { > + p->irq_debug = -5; > serial8250_set_THRI(p); > + } > > spin_unlock_irqrestore(&p->port.lock, flags); > } > diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c > index fa43df05342b..193c08b1888a 100644 > --- a/drivers/tty/serial/8250/8250_port.c > +++ b/drivers/tty/serial/8250/8250_port.c > @@ -1923,6 +1923,9 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir) > struct uart_8250_port *up = up_to_u8250p(port); > bool skip_rx = false; > unsigned long flags; > + long irq_debug; > + unsigned int tx_running = 0, rx_running = 0, tx_err = 0; > + unsigned int ier; > u16 status; > > if (iir & UART_IIR_NO_INT) > @@ -1931,6 +1934,12 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir) > spin_lock_irqsave(&port->lock, flags); > > status = serial_lsr_in(up); > + if (up->dma) { > + rx_running = up->dma->rx_running; > + tx_running = up->dma->tx_running; > + tx_err = up->dma->tx_err; > + } > + ier = up->ier; > > /* > * If port is stopped and there are no error conditions in the > @@ -1957,7 +1966,11 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir) > __stop_tx(up); > } > > + irq_debug = up->irq_debug++; > uart_unlock_and_check_sysrq_irqrestore(port, flags); > + if (irq_debug < 0) > + pr_err("8250irq: iir=%02x lsr+saved=%02x ier=%02x dma_t/rx/err=%u/%u/%u skiprx=%d cnt=%ld\n", > + iir, status, ier, tx_running, rx_running, tx_err, skip_rx, irq_debug); > > return 1; > } > diff --git a/include/linux/serial_8250.h b/include/linux/serial_8250.h > index 741ed4807a9c..8d9d7de6728d 100644 > --- a/include/linux/serial_8250.h > +++ b/include/linux/serial_8250.h > @@ -107,6 +107,7 @@ struct uart_8250_port { > unsigned char mcr; > unsigned char cur_iotype; /* Running I/O type */ > unsigned int rpm_tx_active; > + long irq_debug; > unsigned char canary; /* non-zero during system sleep > * if no_console_suspend > */ >