On Sun, Nov 15, 2020 at 8:55 PM wi nk <wink@xxxxxxxxxxx> wrote: > > On Sun, Nov 15, 2020 at 2:30 PM Thomas Krause <thomaskrause@xxxxxxxxx> wrote: > > > > > > Am 12.11.20 um 16:44 schrieb wi nk: > > > On Thu, Nov 12, 2020 at 10:00 AM Kalle Valo <kvalo@xxxxxxxxxxxxxx> wrote: > > >> wi nk <wink@xxxxxxxxxxx> writes: > > >> > > >>> On Thu, Nov 12, 2020 at 8:15 AM Kalle Valo <kvalo@xxxxxxxxxxxxxx> wrote: > > >>>> Stefani Seibold <stefani@xxxxxxxxxxx> writes: > > >>>> > > >>>>> Am Donnerstag, den 12.11.2020, 02:10 +0100 schrieb wi nk: > > >>>>>> I've yet to see any instability after 45 minutes of exercising it, I > > >>>>>> do see a couple of messages that came out of the driver: > > >>>>>> > > >>>>>> [ 8.963389] ath11k_pci 0000:55:00.0: Unknown eventid: 0x16005 > > >>>>>> [ 11.342317] ath11k_pci 0000:55:00.0: Unknown eventid: 0x1d00a > > >>>>>> > > >>>>>> then when it associates: > > >>>>>> > > >>>>>> [ 16.718895] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3) > > >>>>>> [ 16.722636] wlp85s0: authenticated > > >>>>>> [ 16.724150] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3) > > >>>>>> [ 16.726486] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea > > >>>>>> (capab=0x411 status=0 aid=8) > > >>>>>> [ 16.738443] wlp85s0: associated > > >>>>>> [ 16.764966] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes > > >>>>>> ready > > >>>>>> > > >>>>>> The adapter is achieving around 500 mbps on my gigabit connection, my > > >>>>>> 2018 mbp sees around 650, so it's doing pretty well so far. > > >>>>>> > > >>>>>> Stefani - when you applied the patch that Kalle shared, which branch > > >>>>>> did you apply it to? I applied it to ath11k-qca6390-bringup and when > > >>>>>> I revert 7fef431be9c9 there is a small merge conflict I needed to > > >>>>>> resolve. I wonder if either the starting branch, or your chosen > > >>>>>> resolution are related to the instability you see (or I'm just lucky > > >>>>>> so far! :)). > > >>>>>> > > >>>>> I used the vanilla kernel tree > > >>>>> https://git.kernel.org/torvalds/t/linux-5.10-rc2.tar.gz. On top of this > > >>>>> i applied the > > >>>>> > > >>>>> RFT-ath11k-pci-support-platforms-with-one-MSI-vector.patch > > >>>>> > > >>>>> and reverted the patch 7fef431be9c9 > > >>>> I did also my testing on v5.10-rc2 and I recommend to use that as the > > >>>> baseline when debuggin these ath11k problems. It helps to compare the > > >>>> results if everyone have the same baseline. > > >>>> > > >>>> -- > > >>>> https://patchwork.kernel.org/project/linux-wireless/list/ > > >>>> > > >>>> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches > > >>> Absolutely, I'll rebuild to 5.10 later today and apply the same series > > >>> of patches and report back. > > >> Great, thanks. > > >> > > >>> I'll also test out the patch on both versions from Carl to fix > > >>> resuming. It stands to reason that we may be seeing another regression > > >>> between Stefani (5.10) and myself (5.9 bringup branch) as I don't see > > >>> any disconnections or instability once the interface is online. > > >> Yeah, there is something strange happening between v5.9 and v5.10 we > > >> have not yet figured out. Most likely it has something to do with memory > > >> allocations and DMA transfers failing, but no clear understanding yet. > > >> > > >> But to keep things simple let's only discuss the MSI problem on this > > >> thread, and discuss the timeouts in the another thread: > > >> > > >> http://lists.infradead.org/pipermail/ath11k/2020-November/000641.html > > >> > > >> I'll include you and other reporters to that thread. > > >> > > >> -- > > >> https://patchwork.kernel.org/project/linux-wireless/list/ > > >> > > >> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches > > > Ok, I've tried a clean checkout of 5.10-rc2 with the one MSI patch > > > applied and 7fef431be9c9 reverted. I can't get my machine to boot > > > into anything usable with that configuration. I'm running ubuntu so > > > its starting right into X and sometime between showing the available > > > users and me clicking the icon to login the machine freezes. I can > > > see in the system tray that the wifi adapter is being activated and > > > appears to have associated with an AP, I just can't do much beyond > > > that as the keyboard backlight wakes up, but the caps lock key doesn't > > > work. I see similar behavior with the 5.9 configuration, but after a > > > reboot or two I win whatever race is occuring. With 5.10, I tried > > > maybe 10-15 times with 0 success. > > > > I can confirm this behavior on my configuration. I managed to login once > > and select the Wifi and connect to it. It seemed curiously enough be > > stable long enough to enter the Wifi passphrase. After the connection > > was established, the system hang and on each attempt to reboot into the > > graphical system it would freeze at some point (sometimes even before > > showing the login screen). > > > > Kernel was both based on 5.10-rc2 and 5.10-rc3 (I did see the same > > behavior) with the patch applied, 7fef431be9c9 reverted and firmware > > downloaded and copied to /lib/firmware/ath11k/QCA6390/hw2.0/. > > > > > > I did a bit more digging to see if I could find any new information, > I'm not sure I did but here's what I did / found. I spent the time to > get a kdump kernel running and enabled, I was able to SysRq-C (both > via keyboard and echo c > /proc/sysrq-trigger) and generate a crash > dump. Actually viewing them at the moment will require reverting a > couple of patches to printk to fix the file for the crash utility > (https://github.com/crash-utility/crash/issues/67), but right now > that's not super important since the mechanism isn't being triggered. > As reported here and by Mitchell, the adapter will work occasionally, > but more often it will hang the machine (I too tried 5.10-rc3 with no > noticable differences). Whatever is causing the system to hang isn't > triggering the kdump kernel to take over and dump the vmcore. I've > set watchdog=1 , nmi_watchdog=1, hung_task_panic=1, softlockup_panic=1 > trying to convince the kernel to dump it's state during this. I've > not been able to make it write a crash, it just sits 'hung'. One > interesting observation that may be related, is that if the lockup > occurs during my login, I can actually see the system grind to a halt > over the course of a number of frames (the rendering of the login > animations starts to stutter/get really slow, then after a few frames > everything is frozen). If something were spin locking/ed, I'd expect > the soft lockup panic to find it, but I don't know these mechanisms > well. > > The only consistent behavior that I managed to create is that if the > wifi adapter / machine are in a 'working' state (ie: I can browse the > internet, etc) and I issue sysrq-c to crash the kernel and then let > the crash dump write and reboot the machine, once booted the adapter > is no longer seen by the kernel, and there are zero messages in dmesg > that match "ath11k". The driver shows up in lsmod , but it reports > zero messages and it's like the adapter is completely invisible. A > power off and back on of the machine will re-enter it into the > freezing/wifi working lottery. Good evening all! Just wanted to follow up as I think I've started to uncover some of what's happening with the XPS and this driver. So since I can't get the kdump kernel to dump anything for me, I took a bit more of a naive approach. I blacklisted the modules (ath11k / ath11k_pci) from modprobe so I could at least control when it was loaded. I managed to capture a series of crashes (in phone pics, but I'll transcribe the relevant bits here) that seem to indicate some kind of runaway / spin locked behavior. In all but one case[*], both the crash and the eventual working state, the driver completely initialized successfully with messaging like this: [ 23.209335] ath11k_pci 0000:55:00.0: WARNING: ath11k PCI support is experimental! [ 23.209404] ath11k_pci 0000:55:00.0: BAR 0: assigned [mem 0x8e300000-0x8e3fffff 64bit] [ 23.209421] ath11k_pci 0000:55:00.0: enabling device (0000 -> 0002) [ 23.209502] ath11k_pci 0000:55:00.0: MSI vectors: 1 [ 23.454227] ath11k_pci 0000:55:00.0: Respond mem req failed, result: 1, err: 0 [ 23.454233] ath11k_pci 0000:55:00.0: qmi failed to respond fw mem req:-22 [ 23.455810] ath11k_pci 0000:55:00.0: req mem_seg[0] 0x27d00000 524288 1 [ 23.455814] ath11k_pci 0000:55:00.0: req mem_seg[1] 0x27d80000 524288 1 [ 23.455816] ath11k_pci 0000:55:00.0: req mem_seg[2] 0x27e00000 524288 1 [ 23.455817] ath11k_pci 0000:55:00.0: req mem_seg[3] 0x27e80000 294912 1 [ 23.455819] ath11k_pci 0000:55:00.0: req mem_seg[4] 0x27f00000 524288 1 [ 23.455820] ath11k_pci 0000:55:00.0: req mem_seg[5] 0x27f80000 524288 1 [ 23.455822] ath11k_pci 0000:55:00.0: req mem_seg[6] 0x27800000 458752 1 [ 23.455824] ath11k_pci 0000:55:00.0: req mem_seg[7] 0x27cc0000 131072 1 [ 23.455825] ath11k_pci 0000:55:00.0: req mem_seg[8] 0x27880000 524288 4 [ 23.455827] ath11k_pci 0000:55:00.0: req mem_seg[9] 0x27900000 360448 4 [ 23.455829] ath11k_pci 0000:55:00.0: req mem_seg[10] 0x27ca4000 16384 1 [ 23.466226] ath11k_pci 0000:55:00.0: chip_id 0x0 chip_family 0xb board_id 0xff soc_id 0xffffffff [ 23.466230] ath11k_pci 0000:55:00.0: fw_version 0x101c06cc fw_build_timestamp 2020-06-24 19:50 fw_build_id [ 23.677675] ath11k_pci 0000:55:00.0 wlp85s0: renamed from wlan0 So up until this point, everything is working without issues. Everything seems to spiral out of control a couple of seconds later when my system attempts to actually bring up the adapter. In most of the crash states I will see this: [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3) [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3) [ 31.391928] wlp85s0: authenticated [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3) [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea (capab=0x411 status=0 aid=6) [ 31.407730] wlp85s0: associated [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready And then either somewhere in that pile of messages, or a second or two after this my machine will start to stutter as I mentioned before, and then it either hangs, or I see this message (I'm truncating the timestamp): [ 35.xxxx ] sched: RT throttling activated After that moment, the machine is unresponsive. Sorry I can't seem to extract this data other than screenshots from my phone at the moment, you can see the dmesg output from 6 different hangs here: https://github.com/w1nk/ath11k-debug * - In the case where the driver didn't fully initialize successfully and hung; during the initialization right after the "MSI vectors: %d" printk, I started seeing these: [ 77.xxx ] alloc_contig_range: [88d8e0, 88d8e9) PFNs busy