Hi Johannes, Thanks for your help, I just make some tests with the patch you provided. Yes, there is none warning calltrace again, and none other calltrace too. But for the tmp_list, when we abort a roc and add to the tail of tmp_list, We will delete it from tmp_list in the work async, we need to modify the prev and next pointer of the head of tmp_list, I think they are freed from the stack of function ieee80211_roc_purge(), it is dangerous. Felix Liao -----Original Message----- From: Johannes Berg [mailto:johannes@xxxxxxxxxxxxxxxx] Sent: Saturday, November 24, 2012 7:20 AM To: Felix Liao Cc: linux-wireless@xxxxxxxxxxxxxxx; Ocean Su; Joe Qiao; Bryan Phillippe Subject: Re: Badness at net/mac80211/offchannel.c:264 Hi Felix, Thanks for the report and the detailed description and analysis. > hostapd->ieee80211_start_roc_work[2146] local->roc_list: <empty> > hostapd->ieee80211_start_roc_work[2300] local->roc_list: <1> > roc d90e7480 started 0 abrt 0 hw_beg 0 notified 0 hw_start 0 dur > 60(60) > phy0->ieee80211_sw_roc_work[333] local->roc_list: <1> > roc d90e7480 started 0 abrt 0 hw_beg 0 notified 0 hw_start 0 dur > 60(60) > phy0->ieee80211_sw_roc_work[392] local->roc_list: <1> > roc d90e7480 started 1 abrt 0 hw_beg 0 notified 1 hw_start 0 dur > 60(60) > hostapd->ieee80211_roc_purge[465] local->roc_list: <1> > roc d90e7480 started 1 abrt 0 hw_beg 0 notified 1 hw_start 0 dur > 60(60) > hostapd->ieee80211_start_next_roc[264] roc d90e7480 cause call trace! > hostapd->ieee80211_roc_purge[480] local->roc_list: <1> > roc d90e7480 started 1 abrt 0 hw_beg 0 notified 1 hw_start 0 dur > 60(60) I think it would have been instrumental here to print out the different interfaces. It seems to me that the roc_purge() is called for a different interface, and doesn't actually remove any ROC work item at all. > I suspect the root cause of this issue is after ieee80211_sw_roc_work > set roc->start to true and queue it again, the hostapd just call > linux_set_iface_flags right now. Yes and no, hostapd is changing an interface to be down, but it would appear to be some other interface, not the one that's executing the ROC. > I just increase the delay time to magnify the probability of this > issue, then it is more easier to reproduce this issue. Makes sense. > So I think we should schedule the work as soon as possible, and should > set the delay time to zero, that is, > @@ -367,7 +367,7 @@ void ieee80211_sw_roc_work(struct work_struct > *work) > > roc->started = true; > - ieee80211_queue_delayed_work(&local->hw, &roc->work, > - msecs_to_jiffies(roc->duration)); > + ieee80211_queue_delayed_work(&local->hw, &roc->work, > + 0); definitely not, no, this would defeat the purpose of the ROC, to stay on the given channel for the given time -- now you've cut the time to zero which isn't what we want at all. > At last, to be save I think we should use a static tmp_list or a > global tmp_list to save the abort roc in ieee80211_roc_purge. No, that totally doesn't seem right, a static list here would completely break if the function was called on two CPUs at the same time (this can't happen right now, but still) > I had test this patch using the same testcase above 10 times, it runs > very well and never dump the warning calltrace, I can believe that, but it totally defeats the purpose of the ACS patch you're using :-) I think the root cause of the issue is that in ieee80211_roc_purge() we unconditionally call ieee80211_start_next_roc(), even if nothing was removed. So I think what happens overall is this: * roc is added on, e.g., wlan0 * roc starts running (roc->started becomes true) * hostapd decides to turn off some other interface (say moni.wlan0?) * therefore, ieee80211_roc_purge() is called _for that other interface_ * the loop in ieee80211_roc_purge() does nothing since there's no roc for the other interface and the original roc stays at the front of the list * we call ieee80211_start_next_roc() and it (correctly) warns So the first thing that comes to mind is that we must not call ieee80211_start_next_roc() when there was no ROC item removed from the list. However, it turns out we can completely remove the call, since the second loop (over the tmp_list) will actually clear all the ROC items, and run ieee80211_sw_roc_work() for each item. If the ROC was started, then this function will call ieee80211_start_next_roc(). So this simple patch should fix your issue: diff --git a/net/mac80211/offchannel.c b/net/mac80211/offchannel.c index 635c325..2138dc3 100644 --- a/net/mac80211/offchannel.c +++ b/net/mac80211/offchannel.c @@ -453,8 +453,6 @@ void ieee80211_roc_purge(struct ieee80211_sub_if_data *sdata) list_move_tail(&roc->list, &tmp_list); roc->abort = true; } - - ieee80211_start_next_roc(local); mutex_unlock(&local->mtx); list_for_each_entry_safe(roc, tmp, &tmp_list, list) { Please let me know if it doesn't, that would mean my analysis is completely wrong. johannes ��.n��������+%������w��{.n�����{���zW����ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f