Re: AW: wilc1000: error when transferring large files

Alexis Lothoré <alexis.lothore@xxxxxxxxxxx> · Mon, 13 May 2024 10:58:26 +0200

Hello Heiko,

sorry for the late reply, I was away from computer most of the previous week.

On 5/8/24 08:36, Heiko Thiery wrote:
> Hi Alexis,
> 
>> Hello,
>>
>> I'm part of a team working on a BSP for a board with a WILC1000 module from Microchip (connected via SDIO interface). Currently we are having issues when transferring large files over a WiFi connection. Here is our steps to reproduce:
>>
>> # generating file to transfer
>> dd if=/dev/random of=./test_300mb_file bs=1M count=300
>>
>> # transferring a file from a linux laptop to the test board
>> scp test_300mb_file user@192.168.1.25:~/
>>
>> During transmission, the connection speed drops to almost zero and returns to usual values after a while. At the moment of the drop, we see the following messages in the kernel log:
>>
>> ------------------------------------
>> [  138.016229] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  140.001682] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  140.009719] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  142.003636] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  142.011670] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  144.006676] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  144.014709] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  146.007049] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  146.015082] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
>> [  156.123890] mmc0: Timeout waiting for hardware interrupt.
>> [  156.129318] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>> [  156.135764] mmc0: sdhci: Sys addr:  0x4cef1400 | Version:  0x00000002
>> [  156.142211] mmc0: sdhci: Blk size:  0x00000170 | Blk cnt:  0x00000001
>> [  156.148655] mmc0: sdhci: Argument:  0x14000170 | Trn mode: 0x00000013
>> [  156.155102] mmc0: sdhci: Present:   0x01d88a0a | Host ctl: 0x00000013
>> [  156.161547] mmc0: sdhci: Power:     0x00000002 | Blk gap:  0x00000080
>> [  156.167992] mmc0: sdhci: Wake-up:   0x00000008 | Clock:    0x0000003f
>> [  156.174438] mmc0: sdhci: Timeout:   0x0000008f | Int stat: 0x00000000
>> [  156.180883] mmc0: sdhci: Int enab:  0x107f100b | Sig enab: 0x107f100b
>> [  156.187330] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000502
>> [  156.193776] mmc0: sdhci: Caps:      0x07eb0000 | Caps_1:   0x0000b407
>> [  156.200223] mmc0: sdhci: Cmd:       0x0000353a | Max curr: 0x00ffffff
>> [  156.206669] mmc0: sdhci: Resp[0]:   0x00001000 | Resp[1]:  0x00000000
>> [  156.213113] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>> [  156.219557] mmc0: sdhci: Host ctl2: 0x00000000
>> [  156.224005] mmc0: sdhci: ADMA Err:  0x00000007 | ADMA Ptr: 0x40881200
>> [  156.230450] mmc0: sdhci-esdhc-imx: ========= ESDHC IMX DEBUG STATUS DUMP =========
>> [  156.238024] mmc0: sdhci-esdhc-imx: cmd debug status:  0x3100
>> [  156.243687] mmc0: sdhci-esdhc-imx: data debug status:  0x3200
>> [  156.249438] mmc0: sdhci-esdhc-imx: trans debug status:  0x3300
>> [  156.255276] mmc0: sdhci-esdhc-imx: dma debug status:  0x3400
>> [  156.260938] mmc0: sdhci-esdhc-imx: adma debug status:  0x35b4
>> [  156.266690] mmc0: sdhci-esdhc-imx: fifo debug status:  0x3610
>> [  156.272439] mmc0: sdhci-esdhc-imx: async fifo debug status:  0x3751
>> [  156.278710] mmc0: sdhci: ============================================
>> [  156.287352] wilc1000_sdio mmc0:0001:1: wilc_sdio_cmd53..failed, err(-110)
>> [  156.294177] wilc1000_sdio mmc0:0001:1: Failed cmd53 [0], bytes read...
>> ------------------------------------
> 
> since you are also using the WILC1000 device, I wanted to ask if you have ever seen such behavior?
> As Kirill wrote, with newer kernel versions it only occurs with longer transfers.

I remember having seen a few time the described behavior (speed dropping to 0,
then resuming normally during iperf sessions) but not the associated logs. Maybe
my kernel configuration is different enough from Kirill's one, especially
regarding the tick configuration. Can I have the corresponding defconfig ?

For the second part of the logs (starting from "Timeout waiting for hardware
interrupt), I have seen them, but never in "nominal" operation, only when
unplugging the module during operation (I am working with wilc1000 sd, which is
a kind of "hotpluggable" wilc 1000 in SD/MMC port).

I am wondering if this scp transfer may make the chip reach a memory limit,
especially since you observe the issue less frequently with newer kernel. I
remember some patches from Microchip team mitigating issue about this on tx side
(a08bb28f6eb6 ("wifi: wilc1000: add back-off algorithm to balance tx queue
packets")). That's only a wild thought of course, and the face issue certainly
needs some new fix even if it is indeed this patch reducing the issue frequency

Thanks,

Alexis

-- 
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com