> From: Sergey Shtylyov, Sent: Friday, November 17, 2023 3:22 AM > > On 11/16/23 5:43 AM, Yoshihiro Shimoda wrote: > [...] > > >>> Fix races between ravb_tx_timeout_work() and functions of net_device_ops > >>> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that > >>> since ravb_close() is under the rtnl lock and calls cancel_work_sync(), > >>> ravb_tx_timeout_work() should calls rtnl_trylock(). Otherwise, a deadlock > >>> may happen in ravb_tx_timeout_work() like below: > >>> > >>> CPU0 CPU1 > >>> ravb_tx_timeout() > >>> schedule_work() > >>> ... > >>> __dev_close_many() > >>> // Under rtnl lock > >>> ravb_close() > >>> cancel_work_sync() > >>> // Waiting > >>> ravb_tx_timeout_work() > >>> rtnl_lock() > >>> // This is possible to cause a deadlock > >>> > >>> And, if rtnl_trylock() fails and the netif is still running, > >>> rescheduling the work with 1 msec delayed. So, using > > Ah, you say 1 ms here but 10 ms in the code! Not good... :-) Indeed... > >>> schedule_delayed_work() instead of schedule_work(). > >>> > >>> Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") > >>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@xxxxxxxxxxx> > >>> Reviewed-by: Sergey Shtylyov <s.shtylyov@xxxxxx> > >> > >> Hm, I haven't reviewed this version... :-) > > > > Oops, I should have dropped the tag... > > > >>> Reviewed-by: Simon Horman <horms@xxxxxxxxxx> > >> [...] > >>> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h > >>> index e0f8276cffed..e9bb8ee3ba2d 100644 > >>> --- a/drivers/net/ethernet/renesas/ravb.h > >>> +++ b/drivers/net/ethernet/renesas/ravb.h > >>> @@ -1081,7 +1081,7 @@ struct ravb_private { > >>> u32 cur_tx[NUM_TX_QUEUE]; > >>> u32 dirty_tx[NUM_TX_QUEUE]; > >>> struct napi_struct napi[NUM_RX_QUEUE]; > >>> - struct work_struct work; > >>> + struct delayed_work work; > >> > >> Not sure this is justified... > >> Then again, what do I know about workqueues? Not much... :-) > > > > I thought that the schedule_work() called the work function immediately. > > So, I thought call*ing the schedule_work() from the work function caused > > endless loop. However, it is not true. The schedule_work() just inserts > > a work queue, and then the kernel calls the work function later. > > > > So, changing from work_struct to delayed_work is not needed for fixing > > this issue, I think now. However, I have another concern about rescheduling > > this work by schedule_work() here because it's possible to cause high CPU load > > while the rtnl_lock() is held. So, I think we should call a sleep function > > like usleep_range(1000, 2000) for instance before schedule_work(). > > But, what do you think? > > I think that a sleep before requeuing is pretty much the same as using > a delayed work... I think so. Since this is a fixed patch, using a sleep function is better than converting delayed_work, I think. But, what do you think? > >> [...] > >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > >>> index c70cff80cc99..ca7db8a5b412 100644 > >>> --- a/drivers/net/ethernet/renesas/ravb_main.c > >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c > >>> @@ -1863,17 +1863,24 @@ static void ravb_tx_timeout(struct net_device *ndev, unsigned int txqueue) > >>> /* tx_errors count up */ > >>> ndev->stats.tx_errors++; > >>> > >>> - schedule_work(&priv->work); > >>> + schedule_delayed_work(&priv->work, 0); > >>> } > >>> > >>> static void ravb_tx_timeout_work(struct work_struct *work) > >>> { > >>> - struct ravb_private *priv = container_of(work, struct ravb_private, > >>> + struct delayed_work *dwork = to_delayed_work(work); > >>> + struct ravb_private *priv = container_of(dwork, struct ravb_private, > >>> work); > >>> const struct ravb_hw_info *info = priv->info; > >>> struct net_device *ndev = priv->ndev; > >>> int error; > >>> > >>> + if (!rtnl_trylock()) { > >>> + if (netif_running(ndev)) > >>> + schedule_delayed_work(&priv->work, msecs_to_jiffies(10)); > > You could reuse dwork instead of &priv->work here... I think so. > >> The delay is rather arbitrary. Why not e.g. 1 ms? > > > > I think that 1 ms is enough. > > Seeing now that 1 ms was intended... Yes... Best regards, Yoshihiro Shimoda > > Best regards, > > Yoshihiro Shimoda > > [...] > > MBR, Sergey