On 11/25/2024 9:32 PM, James Prestwood wrote: > Hi Baochen, > > On 9/4/24 6:46 PM, Baochen Qiang wrote: >> >> On 9/5/2024 2:03 AM, Jeff Johnson wrote: >>> On 8/16/2024 5:04 AM, James Prestwood wrote: >>>> Hi Baochen, >>>> >>>> On 8/16/24 3:19 AM, Baochen Qiang wrote: >>>>> On 7/12/2024 9:11 PM, James Prestwood wrote: >>>>>> Hi, >>>>>> >>>>>> I've seen this error mentioned on random forum posts, but its always associated with >>>>>> a kernel crash/warning or some very obvious negative behavior. I've noticed this >>>>>> occasionally and at one location very frequently during FT roaming, specifically >>>>>> just after CMD_ASSOCIATE is issued. For our company run networks I'm not seeing any >>>>>> negative behavior apart from a 3 second delay in sending the re-association frame >>>>>> since the kernel waits for this timeout. But we have some networks our clients run >>>>>> on that we do not own (different vendor), and we are seeing association timeouts >>>>>> after this error occurs and in some cases the AP is sending a deauthentication with >>>>>> reason code 8 instead of replying with a reassociation reply and an error status, >>>>>> which is quite odd. >>>>>> >>>>>> We are chasing down this with the vendor of these APs as well, but the behavior >>>>>> always happens after we see this key removal failure/timeout on the client side. So >>>>>> it would appear there is potentially a problem on both the client and AP. My guess >>>>>> is _something_ about the re-association frame changes when this error is >>>>>> encountered, but I cannot see how that would be the case. We are working to get >>>>>> PCAPs now, but its through a 3rd party, so that timing is out of my control. >>>>>> >>>>>> From the kernel code this error would appear innocuous, the old key is failing to >>>>>> be removed but it gets immediately replaced by the new key. And we don't see that >>>>>> addition failing. Am I understanding that logic correctly? I.e. this logic: >>>>>> >>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ >>>>>> mac80211/key.c#n503 >>>>>> >>>>>> Below are a few kernel logs of the issue happening, some with the deauth being sent >>>>>> by the AP, some with just timeouts: >>>>>> >>>>>> --- No deauth frame sent, just association timeouts after the error --- >>>>>> >>>>>> Jul 11 00:05:30 kernel: wlan0: disconnect from AP <previous BSS> for new assoc to >>>>>> <new BSS> >>>>>> Jul 11 00:05:33 kernel: ath10k_pci 0000:02:00.0: failed to install key for vdev 0 >>>>>> peer <previous BSS>: -110 >>>>>> Jul 11 00:05:33 kernel: wlan0: failed to remove key (0, <previous BSS>) from >>>>>> hardware (-110) >>>>>> Jul 11 00:05:33 kernel: wlan0: associate with <new BSS> (try 1/3) >>>>>> Jul 11 00:05:33 kernel: wlan0: associate with <new BSS> (try 2/3) >>>>>> Jul 11 00:05:33 kernel: wlan0: associate with <new BSS> (try 3/3) >>>>>> Jul 11 00:05:33 kernel: wlan0: association with <new BSS> timed out >>>>>> Jul 11 00:05:36 kernel: wlan0: authenticate with <new BSS> >>>>>> Jul 11 00:05:36 kernel: wlan0: send auth to <new BSS>a (try 1/3) >>>>>> Jul 11 00:05:36 kernel: wlan0: authenticated >>>>>> Jul 11 00:05:36 kernel: wlan0: associate with <new BSS> (try 1/3) >>>>>> Jul 11 00:05:36 kernel: wlan0: RX AssocResp from <new BSS> (capab=0x1111 status=0 >>>>>> aid=16) >>>>>> Jul 11 00:05:36 kernel: wlan0: associated >>>>>> >>>>>> --- Deauth frame sent amidst the association timeouts --- >>>>>> >>>>>> Jul 11 00:43:18 kernel: wlan0: disconnect from AP <previous BSS> for new assoc to >>>>>> <new BSS> >>>>>> Jul 11 00:43:21 kernel: ath10k_pci 0000:02:00.0: failed to install key for vdev 0 >>>>>> peer <previous BSS>: -110 >>>>>> Jul 11 00:43:21 kernel: wlan0: failed to remove key (0, <previous BSS>) from >>>>>> hardware (-110) >>>>>> Jul 11 00:43:21 kernel: wlan0: associate with <new BSS> (try 1/3) >>>>>> Jul 11 00:43:21 kernel: wlan0: deauthenticated from <new BSS> while associating >>>>>> (Reason: 8=DISASSOC_STA_HAS_LEFT) >>>>>> Jul 11 00:43:24 kernel: wlan0: authenticate with <new BSS> >>>>>> Jul 11 00:43:24 kernel: wlan0: send auth to <new BSS> (try 1/3) >>>>>> Jul 11 00:43:24 kernel: wlan0: authenticated >>>>>> Jul 11 00:43:24 kernel: wlan0: associate with <new BSS> (try 1/3) >>>>>> Jul 11 00:43:24 kernel: wlan0: RX AssocResp from <new BSS> (capab=0x1111 status=0 >>>>>> aid=101) >>>>>> Jul 11 00:43:24 kernel: wlan0: associated >>>>>> >>>>> Hi James, this is QCA6174, right? could you also share firmware version? >>>> Yep, using: >>>> >>>> qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1dac:0261 >>>> firmware ver WLAN.RM.4.4.1-00288- api 6 features wowlan,ignore-otp,mfp >>>> crc32 bf907c7c >>>> >>>> I did try in one instance the latest firmware, 309, and still saw the >>>> same behavior but 288 is what all our devices are running. >>>> >>>> Thanks, >>>> >>>> James >>> Baochen, are you looking more into this? Would prefer to fix the root cause >>> rather than take "[RFC 0/1] wifi: ath10k: improvement on key removal failure" >> I asked CST team to try to reproduce this issue such that we can get firmware dump for >> debug further. What I got is that CST team is currently busy at other critical schedules >> and they are planning to debug this ath10k issue after those schedules get finished. > > Any movement on this front? We are still carrying that RFC patch to work around the > associated compatibility issues with Cisco APs when this timeout occurs. I ask the test team again, the response is that hopefully they can get bandwidth next week. > > While I do agree the RFC patch isn't optimal, trying to get a firmware fix for ~6 year old > hardware also may not be very easy. fwiw we've been running the RFC patch for about 3 > months now, as of today its running on over 4000 client devices. So IMO the patch itself > is safe if there was any concern. thanks for the info. > > Thanks, > > James >