Re: the commit c434e5e48dc4 (rsi: Use resume_noirq for SDIO) introduced driver crash in the 4.15 kernel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/19/21 10:52 AM, Hui Wang wrote:

On 8/19/21 3:49 PM, Marek Vasut wrote:
On 8/19/21 7:31 AM, Greg Kroah-Hartman wrote:
On Thu, Aug 19, 2021 at 10:57:03AM +0800, Hui Wang wrote:

On 8/18/21 5:04 PM, Marek Vasut wrote:
On 8/18/21 7:33 AM, Greg Kroah-Hartman wrote:
On Wed, Aug 18, 2021 at 12:06:15PM +0800, Hui Wang wrote:
Hi Marex,

We backported this patch to ubuntu 4.15.0-generic kernel, and
found this
patch introduced the rsi driver crashing when running system
resume on the
Dell 300x IoT platform (100% rate). Below is the log, After
seeing this log,
the rsi wifi can't work anymore, need to run 'rmmod rsi_sdio;modprobe
rsi_sdio" to make it work again.

So do you know what is missing apart from this patch or this
patch is not
suitable for 4.15 kernel at all?

Does 4.19.191 work for this system?  Why not just use that or newer
instead?

I haven't seen this on linux-stable 5.4.y or 5.10.y, if that information
is of any use.

But I have to admit, I am tempted to mark the whole driver as BROKEN and
submit that for stable backports.

Because that is what it is, it is buggy, broken, and the hardware lacks
any documentation. I spent an insane amount of time talking to RedPine
Signals / SiLabs trying to get help with basic things like association
problems against various APs, no result there. I tried getting hardware docs from them so I can fix the driver myself, no result either. So far
I tried to pick various fixes from their downstream driver and submit
them, but that is massively time consuming and the changes there are not
separated or documented, it is just one large chunk of code.

As far as I can tell, they also have no interest in fixing the driver or helping others with fixing it, so maybe we should just mark it as broken
... :-(

Hi Marek,

Got it, thanks for sharing it.

Hi Greg,

I just tested the 4.19.191, got the same result, the wifi will crash after
resume under 4.19.191:

admin@HW6VB02:~$ uname -a
Linux HW6VB02 4.19.191 #1 SMP Thu Aug 19 10:19:32 CST 2021 x86_64 x86_64
x86_64 GNU/Linux

[   59.682908] sdhci-acpi INT33BB:00: pre_suspend failed for non-removable
host: -38
[   59.682917] Freezing user space processes ... (elapsed 0.003 seconds)
done.
[   59.686063] OOM killer disabled.
[   59.686065] Freezing remaining freezable tasks ... (elapsed 0.001
seconds) done.
[   59.687385] Suspending console(s) (use no_console_suspend to debug)
[   59.687931] rsi_91x: ===> Interface DOWN <===
[   70.068983] mmc1: Controller never released inhibit bit(s).
[   70.068992] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
[   70.069002] mmc1: sdhci: Sys addr:  0xffffffff | Version: 0x0000ffff
[   70.069009] mmc1: sdhci: Blk size:  0x0000ffff | Blk cnt: 0x0000ffff
[   70.069016] mmc1: sdhci: Argument:  0xffffffff | Trn mode: 0x0000ffff [   70.069023] mmc1: sdhci: Present:   0xffffffff | Host ctl: 0x000000ff
[   70.069030] mmc1: sdhci: Power:     0x000000ff | Blk gap: 0x000000ff
[   70.069036] mmc1: sdhci: Wake-up:   0x000000ff | Clock: 0x0000ffff
[   70.069043] mmc1: sdhci: Timeout:   0x000000ff | Int stat: 0xffffffff


So let us revert this commit from 4.19.y?

If you revert it, does it work properly?  What about in Linus's tree?

I reverted the commit in the 4.19.191, then the wifi could work both before and after the system resume. I tested the mainline kernel linux-5.13, before suspend, the wifi could work, after suspend, the whole system can't wakeup, and I couldn't recover the system since I can't access the machine physically. I did all test via ssh remotely. So there is no testing result for Linus' tree.

I suspect you just hit the issue this patch was trying to fix then.

If you have console access, use no_console_suspend to see the backtrace on wake up.

I suspect in that case, sdio_claim_host() will spin indefinitely and never finish, see the c434e5e48dc4e ("rsi: Use resume_noirq for SDIO") commit message.
At least, we never seen this issue in the kernel 4.15, without the commit of c434e5e48dc4e ("rsi: Use resume_noirq for SDIO"), the wifi and bluetooth works well before and after suspend.

I suspect you might've just been lucky with that, because it seems RSI did hit it too (see below). This could also be something which triggers only on specific controller drivers (?).


Note that I did my tests on ARM MMCI (stm32mp1 variant).
The platform I am testing is a X86 one, and the sdhci controller driver is sdhci_acpi.c.

Do you have an RSI module which can be plugged into an SD card slot there , or is that RSI module soldered-on on some devkit/board ?

Mine is the later, soldered on a SoM, so I have hard time testing on other SDIO controllers.

This "[   70.068983] mmc1: Controller never released inhibit bit(s)" looks suspicious in the log above.

Also, newer versions of the RSI downstream driver [1] as of 390542d ("Updated Readme.txt file") simply comment out rsi_sdio_enable_interrupts() in rsi/rsi_91x_sdio.c rsi_resume(), which looks like RSI ran into the same problem, but "fixed" it differently. I think that approach RSI took is wrong and it just hid the issue.

[1] git://github.com/SiliconLabs/RS911X-nLink-OSD

The bottom line is, I would really prefer to figure out what the problem that you see on the Linux 5.13.y is and fix that and backport that fix, so the suspend/resume works correctly for everyone ; rather than revert a patch without really understanding the underlying problem.

Sadly, the RSI driver is buggy.



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux