Re: wilc1000 kernel crash

Kirill Buksha <kirbuk200@xxxxxxxxx> · Tue, 4 Apr 2023 18:20:21 +0200

On 4.4.23. 03:30, Ajay.Kathat@xxxxxxxxxxxxx wrote:
> On 4/3/23 07:24, Kirill Buksha wrote:
>> [Some people who received this message don't often get email from kirbuk200@xxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 16.12.22. 11:18, Michael Walle wrote:
>>> Hi,
>>>
>>> On 22/12/09 02:14, Ajay.Kathat@xxxxxxxxxxxxx wrote:
>>>> No progress yet. I tried to simulate the condition a few times but was
>>>> unable to see the exact failure in my setup so I need to try more.
>>> Shouldn't it also be possible to see the issue by code reading? I've
>>> provided the call tree in my previous mail and my concerns regarding
>>> the locking. Either I'm missing something there or there is no
>>> locking between these threads which could cause this issue.
>>>
>>>> For the other "FW not responding" continuous logs, I got some clue.
>>>> Probably, will try to send that patch first.
>>> Ok, let me know if you have some patches, I'm happy to test them.
>>>
>>> -michael
>>>
>>>
>> Hello,
>>
>> I faced the same kernel oops issue. After analyzing my logs and brief
>> debugging, I agree with Mikhail: the problem seems to be accessing the
>> scan_result pointer after it has been nulled.
> I have submitted a patch [1] which has fix for scan_result NULL pointer
> exception issue. The submitted patch handles the synchronization between
> mac_close() and asynchronous interrupts from firmware. Basically, it
> takes care of blocking the execution of mac_close() till all pending
> works are completed and afterward no new work addition is allowed since
> the close is in progress. It is worth to try with that patch once and
> check it's behavior.
>
> 1.
> https://lore.kernel.org/linux-wireless/20230404012010.15261-1-ajay.kathat@xxxxxxxxxxxxx/T/#u

Thank you for the patch. I will take a look/test it when I have time.

>> Regarding the solution: if there is a race between two threads (as
>> Michael described earlier), then I think that the locking mechanism will
>> be the most reliable solution. We ran into problems during
>> deinitialization, but driver contains two more places
>> (handle_scan_done() and wilc_disconnect() functions in wilc1000/hif.c),
>> where scan_result is set to NULL.
>>
>> I use NetworkManager to manage networks and I have experienced the same
>> failure multiple times when switching from one WiFi network to another.
>> Keep in mind that switching between networks calls wilc_disconnect() and
>> wilc_deinit() functions and it is not yet clear which one is causing a
>> core dump. I think it's worth at least taking a look at these areas of
>> the code. What do you think?
> If possible, please share the sequence(commands) for Wifi network
> switching scenario. It looks like both functions(mac_close & disconnect)
> are getting called from user context. mac_close() is a netdevice
> callback whereas wilc_disconnect() is a cfg80211 callback. Generally,
> wilc_disconnect() should be enough to disconnect from current Wifi
> network without bringing the complete interface down. Is NetworkManager
> closing the interface(mac_close()) before switching the WiFi network.
>
>
> Regards,
> Ajay

The commands are as follows:
while true; do nmcli c up wlan0-client; nmcli c up wlan0-client-2; done

It takes about 5 minutes until I see the core dump.
I see following message after every command:
...
wilc1000_sdio mmc0:0001:1 wlan0: Deinitializing wilc1000...
...
Message above comes from wilc_wlan_deinitialize() function which is called from wilc_mac_close(). It seems that interface is closed between connections.

Best regards,
Kirill Buksha.