On Wed, Apr 26, 2023 at 07:07:08AM +0800, Hillf Danton wrote: > Given similar wait timeout[1], just taking lock on the waiter side is not > enough wrt fixing the race, because in case job done on the waker side, > waiter needs to wait again after timeout. > As I understand you correctly, you mean the case when a timeout occurs during ath9k_wmi_ctrl_rx() callback execution. I suppose if a timeout has occurred on a waiter's side, it should return immediately and doesn't have to care in which state the callback has been at that moment. AFAICS, this is controlled properly with taking a wmi_lock on waiter and waker sides, and there is no data corruption. If a callback has not managed to do its work entirely (performing a completion and subsequently waking waiting thread is included here), then, well, it is considered a timeout, in my opinion. Your suggestion makes a wmi_cmd call to give a little more chance for the belated callback to complete (although timeout has actually expired). That is probably good, but increasing a timeout value makes that job, too. I don't think it makes any sense on real hardware. Or do you mean there is data corruption that is properly fixed with your patch? That is, I agree there can be a situation when a callback makes all the logical work it should and it just hasn't got enough time to perform a completion before a timeout on waiter's side occurs. And this behaviour can be named "racy". But, technically, this seems to be a rather valid timeout. > [1] https://lore.kernel.org/lkml/9d9b9652-c1ac-58e9-2eab-9256c17b1da2@xxxxxxxxxxxxxxxxxxx/ > I don't think it's a similar case because wait_for_completion_state() is interruptible while wait_for_completion_timeout() is not. > A correct fix looks like after putting pieces together > > +++ b/drivers/net/wireless/ath/ath9k/wmi.c > @@ -238,6 +238,7 @@ static void ath9k_wmi_ctrl_rx(void *priv > spin_unlock_irqrestore(&wmi->wmi_lock, flags); > goto free_skb; > } > + wmi->last_seq_id = 0; > spin_unlock_irqrestore(&wmi->wmi_lock, flags); > > /* WMI command response */ > @@ -339,9 +340,20 @@ int ath9k_wmi_cmd(struct wmi *wmi, enum > > time_left = wait_for_completion_timeout(&wmi->cmd_wait, timeout); > if (!time_left) { > + unsigned long flags; > + int wait = 0; > + > ath_dbg(common, WMI, "Timeout waiting for WMI command: %s\n", > wmi_cmd_to_name(cmd_id)); > - wmi->last_seq_id = 0; > + > + spin_lock_irqsave(&wmi->wmi_lock, flags); > + if (wmi->last_seq_id == 0) /* job done on the waker side? */ > + wait = 1; > + else > + wmi->last_seq_id = 0; > + spin_unlock_irqrestore(&wmi->wmi_lock, flags); > + if (wait) > + wait_for_completion(&wmi->cmd_wait); > mutex_unlock(&wmi->op_mutex); > return -ETIMEDOUT; > }