Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 3/14/23 17:55, Ilpo Järvinen wrote:
> On Tue, 14 Mar 2023, Hans de Goede wrote:
>> On 3/14/23 12:48, Ilpo Järvinen wrote:
>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>> On 3/14/23 11:55, Ilpo Järvinen wrote:
>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>>>
>>>>>> I have spend the last couple of days debugging a problem with Bluetooth
>>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
>>>>>> and Cherry Trail devices.
>>>>>>
>>>>>> After much debugging I found out that sometimes the first byte of
>>>>>> a received packet (data[0]) would be overwritten with a 0 byte.
>>>>>>
>>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
>>>>>>
>>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>
>>>>>> would sometimes turn into:
>>>>>>
>>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>
>>>>>> Further investigation revealed that this goes away if I stop
>>>>>> the dw_dmac module from loading, leading to:
>>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
>>>>>> and the UART working without DMA support.
>>>>>>
>>>>>> Testing various kernels showed that this problem was introduced
>>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
>>>>>>
>>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
>>>>>>
>>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
>>>>>
>>>>> You did something else too than just that because you cannot cleanly 
>>>>> revert just e8ffbb71f783. Please indicate what happened to:
>>>>>   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
>>>>>
>>>>> I guess you reverted that too and forgot to mention about it but I just 
>>>>> want to be sure we're on the same page?
>>>>
>>>> I manually fixed up the revert, effectively
>>>> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
>>>> I did not revert f8d6e9d3ca5c in its entirety.
>>>>
>>>> I've attached my manual fixed up revert as a patch here.
>>>>
>>>>>> So it seems that that commit somehow interferes with DMA based
>>>>>> data receiving, causing the first byte of a received data transfer
>>>>>> to get replaced by 0.
>>>>>
>>>>> Okay, and you're sure the problem/corruption occurs on the receiving side? 
>>>>
>>>> What I am sure is that the first byte of a packet has been replaced by 0
>>>> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
>>>>
>>>> I did not dive into the serial-port code side of this problem since
>>>> I'm unfamiliar with that.
>>>>
>>>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
>>>>> causes the confusion on the rx side. So that would be an extra call into 
>>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
>>>>> would not occur w/o that tx side change.
>>>>
>>>> That sounds like a likely candidate for causing this yes, as said
>>>> I'm unfamiliar with the serial-port code, but I did already suspect
>>>> that the change was causing some extra interrupt which somehow
>>>> interfered with the rx side.
>>>>
>>>>>> The issue has been seen on and the revert has been tested on
>>>>>> the following HW:
>>>>>>
>>>>>> Asus T100TA
>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>
>>>>>> Lenovo Yoga Tablet 2 1051L
>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>
>>>>>> Lenovo Yoga Book X91F
>>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
>>>>>>
>>>>>> I have more hw which I believe is affected but these are the models
>>>>>> where I have done detailed testing.
>>>>>>
>>>>>> I would be happy to test any patches, or run a kernel with some extra
>>>>>> debugging added, just let me know what you need to help fixing this.
>>>>>
>>>>> How easy this is to trigger in general? (Mainly trying to gauge how 
>>>>> easy it will be to find the read and/or the irq that related to the 
>>>>> corrupted payload).
>>>>
>>>> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
>>>> calls, re-initializing and re-uploading the BT firmware 10 times and
>>>> then looked for "Frame reassembly failed" errors (+ more error caused
>>>> by this error) from the drivers/bluetooth/hci_bcm.c code.
>>>>
>>>> This would usually trigger at least twice with the 10 rmmod + modprobe
>>>> calls and typically more often then twice.
>>>>
>>>> With the revert (and with 5.18.y) I never see a single error even
>>>> with more rmmod + modprobe calls. Note it is not just the driver's
>>>> probe() which fails, sometimes there are also later errors, this
>>>> is just a somewhat convenient way to reproduce.
>>>>
>>>> So this is somewhat easy to trigger, not trivial to trigger,
>>>> but it also does not take waiting many hours before hitting it
>>>> once.
>>>
>>> Okay, maybe the patch below will manage to filter the interesting part of 
>>> it (not tested beyond compiling). It should capture a few irq events after 
>>> the THRI is set when DMA Tx completes.
>>>
>>> What I'd expect is to have iir value variation between the working and not 
>>> working case. Especially, if it's UART_IIR_THRI that shows up that might 
>>> be usable as a condition to return from handle_rx_dma() without touching
>>> DMA Rx.
>>
>> Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without*
>> the revert and with your dmesg patch.
>>
>> This is (part of) the HCI init without issues:
>>
>> [  733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  734.081492] Bluetooth: hci0: BCM: chip id 101
>>
>> Here is another *successful* log with slightly different timing.
>>
>> [  737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  738.050142] Bluetooth: hci0: BCM: chip id 101
>>
>> And here is the same part of the init sequence with an error:
>>
>> [  771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4
>> [  771.632568] Bluetooth: hci0: Frame reassembly failed (-84)
>>
>> Here is a full dmesg from several rmmod + modprobes, for the errors look for
>> "Frame reassembly failed".
>>
>> Note this is from a different boot, I had to add log_buf_len=50M to be able
>> to grab a full dmesg:
>>
>> https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue
> 
> Thanks. I'll try to come up with another debug patch tomorrow to further 
> inspect the payload that will be read after that point.
> 
> In the meantime you could try if the following patch is enough 
> to solve the issue.
> 
> 
> From: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>
> Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt
> 
> Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
> connection failed due corrupted Rx payload. The problem was narrowed
> down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
> despite LSR having DR bit set, which is precondition for attempting to 
> start DMA Rx.
> 
> This problem became apparent after e8ffbb71f783 ("serial: 8250: use
> THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
> using THRI interrupt.
> 
> Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
> interrupt which has Rx related IIR value.
> 
> Reported-by: Hans de Goede <hdegoede@xxxxxxxxxx>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>

I can confirm that this fixes things for me:

Tested-by: Hans de Goede <hdegoede@xxxxxxxxxx>

Regards,

Hans



> 
> ---
>  drivers/tty/serial/8250/8250_port.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
> index fa43df05342b..a3ef2f42a06b 100644
> --- a/drivers/tty/serial/8250/8250_port.c
> +++ b/drivers/tty/serial/8250/8250_port.c
> @@ -1903,6 +1903,8 @@ EXPORT_SYMBOL_GPL(serial8250_modem_status);
>  static bool handle_rx_dma(struct uart_8250_port *up, unsigned int iir)
>  {
>  	switch (iir & 0x3f) {
> +	case UART_IIR_THRI:
> +		return false;
>  	case UART_IIR_RDI:
>  		if (!up->dma->rx_running)
>  			break;
> 




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux PPP]     [Linux FS]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Linmodem]     [Device Mapper]     [Linux Kernel for ARM]

  Powered by Linux