Badness at net/mac80211/offchannel.c:264

Felix Liao <Felix.Liao@xxxxxxxxxxxxxx> · Fri, 23 Nov 2012 10:14:12 +0000

Hi Johannesh,

I'm Felix Liao from WatchGuard, we encountered a warning call trace when update the compat-wireless driver to the latest stable version 3.6.6-1, but it always dump the call trace at line 264 of offchannel.c:
[   82.205559] ------------[ cut here ]------------
[   82.210188] Badness at /opt/compat-wireless-3.6.6-1/net/mac80211/offchannel.c:264
[   82.210203] NIP: e69df0c0 LR: e69df298 CTR: 00000001
[   82.210215] REGS: d9761c30 TRAP: 0700   Tainted: P             (2.6.35.12)
[   82.210225] MSR: 00029000 <EE,ME,CE>  CR: 20222484  XER: 20000000
[   82.210248] TASK = d972cdb0[1578] 'hostapd' THREAD: d9760000
[   82.210257] GPR00: 00000001 d9761ce0 d972cdb0 d95b0a80 00007b2a ffffffff d9761c3e 00007b29 
[   82.210280] GPR08: 00007af0 d95b1358 00000039 00000000 80222484 100885c4 00000000 10059568 
[   82.210304] GPR16: 00000000 00000000 00000000 10081334 df9f7500 e6a12ad0 e6a12a20 00000001 
[   82.210327] GPR24: d95b1358 d95b1358 dfb83300 d95b1148 d95b1358 d9737480 d95b0a80 d9737480 
[   82.210429] NIP [e69df0c0] ieee80211_start_next_roc+0x38/0x198 [mac80211]
[   82.210462] LR [e69df298] ieee80211_roc_purge+0x78/0x248 [mac80211]
[   82.210471] Call Trace:
[   82.210591] [d9761ce0] [00000001] 0x1 (unreliable)
[   82.210632] [d9761d00] [e69df298] ieee80211_roc_purge+0x78/0x248 [mac80211]
[   82.210672] [d9761d50] [e69e5cd0] ieee80211_do_stop+0xb8/0x60c [mac80211]
[   82.210708] [d9761d80] [e69e623c] ieee80211_stop+0x18/0x2c [mac80211]
[   82.210734] [d9761d90] [c02d8ec4] __dev_close+0x8c/0xe0
[   82.210752] [d9761db0] [c02d5b24] __dev_change_flags+0x138/0x190
[   82.210768] [d9761dd0] [c02d8ca4] dev_change_flags+0x24/0x74
[   82.210785] [d9761df0] [c0338480] devinet_ioctl+0x7a0/0x8e8
[   82.210801] [d9761e60] [c0339d58] inet_ioctl+0xcc/0xf8
[   82.210818] [d9761e70] [c02c39f8] sock_ioctl+0x7c/0x330
[   82.210842] [d9761e90] [c00c3bb8] vfs_ioctl+0x34/0x8c
[   82.210858] [d9761ea0] [c00c3ddc] do_vfs_ioctl+0x94/0x7e0
[   82.210874] [d9761f10] [c00c4578] sys_ioctl+0x50/0x94
[   82.210895] [d9761f40] [c0010e78] ret_from_syscall+0x0/0x3c
[   82.210933] --- Exception: c01 at 0xfbfaaf4
[   82.210938]     LR = 0xfc85bcc
[   82.210945] Instruction dump:
[   82.210954] 7c691b78 93c10018 7c7e1b78 90010024 9361000c 93810010 93a10014 93e1001c 
[   82.210980] 87e908d8 7f9f4800 419e0060 881f0048 <0f000000> 2f800000 409e002c 81230058

the device is running as AP and the driver I used is ath9k, that is, the local->ops->remain_on_channel is null to use, besides, one important point is I use the ACS(http://linuxwireless.org/en/users/Documentation/acs) patch in the hostapd.

One feature of this issue is it only happens when hostapd start, but in the hostapd running time, it doesn't happen again(yes, I replace the WARN_ON_ONCE with WARN_ON in line 264 to get this result). Another feature is after I disable the ACS function, this issue doesn't happen again.

Let me present the process in my mind,  when I start hostapd in the shell, it begin to run the ACS functions, which will send the REMAIN_ON_CHANNEL command(see hostapd_drv_remain_on_channel in hostapd source) to cfg80211, and then will call the function ieee80211_remain_on_channel() in mac80211, which will call the function ieee80211_start_roc_work(), which will queue some works ieee80211_sw_roc_work() on the local->workqueue. The work duration 60 is set by hostapd_config_acs_defaults() provided by ACS patch.

But on the other side, the hostapd also send SIOCSIFFLAGS command(see linux_set_iface_flags in hostapd source) to net core device, which will call the function dev_change_flags() to close the device, the call trace is dev_change_flags->ieee80211_stop->ieee80211_do_stop->ieee80211_roc_purge.

I verify that the roc->started was set to true in ieee80211_sw_roc_work:368, the value of roc->started change road is(hostapd and phy0 are gived by current->comm):

hostapd->ieee80211_start_roc_work[2146] local->roc_list: <empty>
hostapd->ieee80211_start_roc_work[2300] local->roc_list: <1>
    roc d90e7480 started 0 abrt 0 hw_beg 0 notified 0 hw_start 0 dur 60(60)
phy0->ieee80211_sw_roc_work[333] local->roc_list: <1>
    roc d90e7480 started 0 abrt 0 hw_beg 0 notified 0 hw_start 0 dur 60(60)
phy0->ieee80211_sw_roc_work[392] local->roc_list: <1>
    roc d90e7480 started 1 abrt 0 hw_beg 0 notified 1 hw_start 0 dur 60(60)
hostapd->ieee80211_roc_purge[465] local->roc_list: <1>
    roc d90e7480 started 1 abrt 0 hw_beg 0 notified 1 hw_start 0 dur 60(60)
hostapd->ieee80211_start_next_roc[264] roc d90e7480 cause call trace!
hostapd->ieee80211_roc_purge[480] local->roc_list: <1>
    roc d90e7480 started 1 abrt 0 hw_beg 0 notified 1 hw_start 0 dur 60(60)

I suspect the root cause of this issue is after ieee80211_sw_roc_work set roc->start to true and queue it again, the hostapd just call linux_set_iface_flags right now.  I just increase the delay time to magnify the probability of this issue, then it is more easier to reproduce this issue. So I think we should schedule the work as soon as possible, and should set the delay time to zero, that is,

--- a/net/mac80211/offchannel.c
+++ b/net/mac80211/offchannel.c
@@ -367,7 +367,7 @@ void ieee80211_sw_roc_work(struct work_struct *work)
 
                roc->started = true;
-               ieee80211_queue_delayed_work(&local->hw, &roc->work,
-                                            msecs_to_jiffies(roc->duration));
+               ieee80211_queue_delayed_work(&local->hw, &roc->work, 0);
        } else {
                /* finish this ROC */
  finish:

this issue is hardly to reproduce with this fix, but since ieee80211_roc_purge() and ieee80211_sw_roc_work are asynchronous, it still happen sometimes. So I take a try to delete the roc from list after queue to avoid the others access it by list, this change requires some other fix, 

At last, to be save I think we should use a static tmp_list or a global tmp_list to save the abort roc in ieee80211_roc_purge.

--------------
the complete patch to fix this issue as below:

$ git diff

diff --git a/net/mac80211/offchannel.c b/net/mac80211/offchannel.c
index 83608ac..0a9bb58 100644
--- a/net/mac80211/offchannel.c
+++ b/net/mac80211/offchannel.c
@@ -336,14 +336,6 @@ void ieee80211_sw_roc_work(struct work_struct *work)
        if (roc->abort)
                goto finish;
 
-       if (WARN_ON(list_empty(&local->roc_list)))
-               goto out_unlock;
-
-       if (WARN_ON(roc != list_first_entry(&local->roc_list,
-                                           struct ieee80211_roc_work,
-                                           list)))
-               goto out_unlock;
-
        if (!roc->started) {
                struct ieee80211_roc_work *dep;
 
@@ -361,17 +353,18 @@ void ieee80211_sw_roc_work(struct work_struct *work)
                list_for_each_entry(dep, &roc->dependents, list)
                        ieee80211_handle_roc_started(dep);
 
+               list_del(&roc->list); /* delete from the local->roc_list */
                /* if it was pure TX, just finish right away */
                if (!roc->duration)
                        goto finish;
 
                roc->started = true;
-               ieee80211_queue_delayed_work(&local->hw, &roc->work,
-                                            msecs_to_jiffies(roc->duration));
+               ieee80211_queue_delayed_work(&local->hw, &roc->work, 0);
        } else {
                /* finish this ROC */
  finish:
-               list_del(&roc->list);
+               if (roc->abort)
+                       list_del(&roc->list); /* delete from the tmp_list */
                started = roc->started;
                ieee80211_roc_notify_destroy(roc);
 
@@ -443,7 +436,7 @@ void ieee80211_roc_purge(struct ieee80211_sub_if_data *sdata)
 {
        struct ieee80211_local *local = sdata->local;
        struct ieee80211_roc_work *roc, *tmp;
-       LIST_HEAD(tmp_list);
+       static LIST_HEAD(tmp_list);
 
        mutex_lock(&local->mtx);
        list_for_each_entry_safe(roc, tmp, &local->roc_list, list) {

--------------
I had test this patch using the same testcase above 10 times, it runs very well and never dump the warning calltrace, can you give a review for this patch? If ok, how can I submit this patch into the repository?

Thanks,
- Felix

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html