Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 3/14/23 12:48, Ilpo Järvinen wrote:
> On Tue, 14 Mar 2023, Hans de Goede wrote:
>> On 3/14/23 11:55, Ilpo Järvinen wrote:
>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>
>>>> I have spend the last couple of days debugging a problem with Bluetooth
>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
>>>> and Cherry Trail devices.
>>>>
>>>> After much debugging I found out that sometimes the first byte of
>>>> a received packet (data[0]) would be overwritten with a 0 byte.
>>>>
>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
>>>>
>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>
>>>> would sometimes turn into:
>>>>
>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>
>>>> Further investigation revealed that this goes away if I stop
>>>> the dw_dmac module from loading, leading to:
>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
>>>> and the UART working without DMA support.
>>>>
>>>> Testing various kernels showed that this problem was introduced
>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
>>>>
>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
>>>>
>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
>>>
>>> You did something else too than just that because you cannot cleanly 
>>> revert just e8ffbb71f783. Please indicate what happened to:
>>>   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
>>>
>>> I guess you reverted that too and forgot to mention about it but I just 
>>> want to be sure we're on the same page?
>>
>> I manually fixed up the revert, effectively
>> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
>> I did not revert f8d6e9d3ca5c in its entirety.
>>
>> I've attached my manual fixed up revert as a patch here.
>>
>>>> So it seems that that commit somehow interferes with DMA based
>>>> data receiving, causing the first byte of a received data transfer
>>>> to get replaced by 0.
>>>
>>> Okay, and you're sure the problem/corruption occurs on the receiving side? 
>>
>> What I am sure is that the first byte of a packet has been replaced by 0
>> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
>>
>> I did not dive into the serial-port code side of this problem since
>> I'm unfamiliar with that.
>>
>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
>>> causes the confusion on the rx side. So that would be an extra call into 
>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
>>> would not occur w/o that tx side change.
>>
>> That sounds like a likely candidate for causing this yes, as said
>> I'm unfamiliar with the serial-port code, but I did already suspect
>> that the change was causing some extra interrupt which somehow
>> interfered with the rx side.
>>
>>>> The issue has been seen on and the revert has been tested on
>>>> the following HW:
>>>>
>>>> Asus T100TA
>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>
>>>> Lenovo Yoga Tablet 2 1051L
>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>
>>>> Lenovo Yoga Book X91F
>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
>>>>
>>>> I have more hw which I believe is affected but these are the models
>>>> where I have done detailed testing.
>>>>
>>>> I would be happy to test any patches, or run a kernel with some extra
>>>> debugging added, just let me know what you need to help fixing this.
>>>
>>> How easy this is to trigger in general? (Mainly trying to gauge how 
>>> easy it will be to find the read and/or the irq that related to the 
>>> corrupted payload).
>>
>> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
>> calls, re-initializing and re-uploading the BT firmware 10 times and
>> then looked for "Frame reassembly failed" errors (+ more error caused
>> by this error) from the drivers/bluetooth/hci_bcm.c code.
>>
>> This would usually trigger at least twice with the 10 rmmod + modprobe
>> calls and typically more often then twice.
>>
>> With the revert (and with 5.18.y) I never see a single error even
>> with more rmmod + modprobe calls. Note it is not just the driver's
>> probe() which fails, sometimes there are also later errors, this
>> is just a somewhat convenient way to reproduce.
>>
>> So this is somewhat easy to trigger, not trivial to trigger,
>> but it also does not take waiting many hours before hitting it
>> once.
> 
> Okay, maybe the patch below will manage to filter the interesting part of 
> it (not tested beyond compiling). It should capture a few irq events after 
> the THRI is set when DMA Tx completes.
> 
> What I'd expect is to have iir value variation between the working and not 
> working case. Especially, if it's UART_IIR_THRI that shows up that might 
> be usable as a condition to return from handle_rx_dma() without touching
> DMA Rx.

Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without*
the revert and with your dmesg patch.

This is (part of) the HCI init without issues:

[  733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
[  733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
[  733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  734.081492] Bluetooth: hci0: BCM: chip id 101

Here is another *successful* log with slightly different timing.

[  737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
[  737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
[  737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  738.050142] Bluetooth: hci0: BCM: chip id 101

And here is the same part of the init sequence with an error:

[  771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
[  771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
[  771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4
[  771.632568] Bluetooth: hci0: Frame reassembly failed (-84)

Here is a full dmesg from several rmmod + modprobes, for the errors look for
"Frame reassembly failed".

Note this is from a different boot, I had to add log_buf_len=50M to be able
to grab a full dmesg:

https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue

Regards,

Hans




> 
> From: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>
> Subject: [PATCH] DEBUG: DMA Rx problem
> 
> A DEBUG PATCH, not intended for upstream inclusing.
> 
> ---
>  drivers/tty/serial/8250/8250_dma.c  |  4 +++-
>  drivers/tty/serial/8250/8250_port.c | 13 +++++++++++++
>  include/linux/serial_8250.h         |  1 +
>  3 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c
> index 7fa66501792d..20d54abdded2 100644
> --- a/drivers/tty/serial/8250/8250_dma.c
> +++ b/drivers/tty/serial/8250/8250_dma.c
> @@ -32,8 +32,10 @@ static void __dma_tx_complete(void *param)
>  		uart_write_wakeup(&p->port);
>  
>  	ret = serial8250_tx_dma(p);
> -	if (ret || !dma->tx_running)
> +	if (ret || !dma->tx_running) {
> +		p->irq_debug = -5;
>  		serial8250_set_THRI(p);
> +	}
>  
>  	spin_unlock_irqrestore(&p->port.lock, flags);
>  }
> diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
> index fa43df05342b..193c08b1888a 100644
> --- a/drivers/tty/serial/8250/8250_port.c
> +++ b/drivers/tty/serial/8250/8250_port.c
> @@ -1923,6 +1923,9 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
>  	struct uart_8250_port *up = up_to_u8250p(port);
>  	bool skip_rx = false;
>  	unsigned long flags;
> +	long irq_debug;
> +	unsigned int tx_running = 0, rx_running = 0, tx_err = 0;
> +	unsigned int ier;
>  	u16 status;
>  
>  	if (iir & UART_IIR_NO_INT)
> @@ -1931,6 +1934,12 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
>  	spin_lock_irqsave(&port->lock, flags);
>  
>  	status = serial_lsr_in(up);
> +	if (up->dma) {
> +		rx_running = up->dma->rx_running;
> +		tx_running = up->dma->tx_running;
> +		tx_err = up->dma->tx_err;
> +	}
> +	ier = up->ier;
>  
>  	/*
>  	 * If port is stopped and there are no error conditions in the
> @@ -1957,7 +1966,11 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
>  			__stop_tx(up);
>  	}
>  
> +	irq_debug = up->irq_debug++;
>  	uart_unlock_and_check_sysrq_irqrestore(port, flags);
> +	if (irq_debug < 0)
> +		pr_err("8250irq: iir=%02x lsr+saved=%02x ier=%02x dma_t/rx/err=%u/%u/%u skiprx=%d cnt=%ld\n",
> +		       iir, status, ier, tx_running, rx_running, tx_err, skip_rx, irq_debug);
>  
>  	return 1;
>  }
> diff --git a/include/linux/serial_8250.h b/include/linux/serial_8250.h
> index 741ed4807a9c..8d9d7de6728d 100644
> --- a/include/linux/serial_8250.h
> +++ b/include/linux/serial_8250.h
> @@ -107,6 +107,7 @@ struct uart_8250_port {
>  	unsigned char		mcr;
>  	unsigned char		cur_iotype;	/* Running I/O type */
>  	unsigned int		rpm_tx_active;
> +	long			irq_debug;
>  	unsigned char		canary;		/* non-zero during system sleep
>  						 *   if no_console_suspend
>  						 */
> 




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux PPP]     [Linux FS]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Linmodem]     [Device Mapper]     [Linux Kernel for ARM]

  Powered by Linux