Re: wilc1000 kernel crash

<Ajay.Kathat@xxxxxxxxxxxxx> · Tue, 4 Apr 2023 01:30:09 +0000

On 4/3/23 07:24, Kirill Buksha wrote:
> [Some people who received this message don't often get email from kirbuk200@xxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On 16.12.22. 11:18, Michael Walle wrote:
>> Hi,
>>
>> On 22/12/09 02:14, Ajay.Kathat@xxxxxxxxxxxxx wrote:
>>> No progress yet. I tried to simulate the condition a few times but was
>>> unable to see the exact failure in my setup so I need to try more.
>> Shouldn't it also be possible to see the issue by code reading? I've
>> provided the call tree in my previous mail and my concerns regarding
>> the locking. Either I'm missing something there or there is no
>> locking between these threads which could cause this issue.
>>
>>> For the other "FW not responding" continuous logs, I got some clue.
>>> Probably, will try to send that patch first.
>> Ok, let me know if you have some patches, I'm happy to test them.
>>
>> -michael
>>
>>
> 
> Hello,
> 
> I faced the same kernel oops issue. After analyzing my logs and brief
> debugging, I agree with Mikhail: the problem seems to be accessing the
> scan_result pointer after it has been nulled.

I have submitted a patch [1] which has fix for scan_result NULL pointer
exception issue. The submitted patch handles the synchronization between
mac_close() and asynchronous interrupts from firmware. Basically, it
takes care of blocking the execution of mac_close() till all pending
works are completed and afterward no new work addition is allowed since
the close is in progress. It is worth to try with that patch once and
check it's behavior.

1.
https://lore.kernel.org/linux-wireless/20230404012010.15261-1-ajay.kathat@xxxxxxxxxxxxx/T/#u

> 
> Regarding the solution: if there is a race between two threads (as
> Michael described earlier), then I think that the locking mechanism will
> be the most reliable solution. We ran into problems during
> deinitialization, but driver contains two more places
> (handle_scan_done() and wilc_disconnect() functions in wilc1000/hif.c),
> where scan_result is set to NULL.
> 
> I use NetworkManager to manage networks and I have experienced the same
> failure multiple times when switching from one WiFi network to another.
> Keep in mind that switching between networks calls wilc_disconnect() and
> wilc_deinit() functions and it is not yet clear which one is causing a
> core dump. I think it's worth at least taking a look at these areas of
> the code. What do you think?

If possible, please share the sequence(commands) for Wifi network
switching scenario. It looks like both functions(mac_close & disconnect)
are getting called from user context. mac_close() is a netdevice
callback whereas wilc_disconnect() is a cfg80211 callback. Generally,
wilc_disconnect() should be enough to disconnect from current Wifi
network without bringing the complete interface down. Is NetworkManager
closing the interface(mac_close()) before switching the WiFi network.

Regards,
Ajay