On 11/16/23 5:43 AM, Yoshihiro Shimoda wrote: [...] >>> Fix races between ravb_tx_timeout_work() and functions of net_device_ops >>> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that >>> since ravb_close() is under the rtnl lock and calls cancel_work_sync(), >>> ravb_tx_timeout_work() should calls rtnl_trylock(). Otherwise, a deadlock >>> may happen in ravb_tx_timeout_work() like below: >>> >>> CPU0 CPU1 >>> ravb_tx_timeout() >>> schedule_work() >>> ... >>> __dev_close_many() >>> // Under rtnl lock >>> ravb_close() >>> cancel_work_sync() >>> // Waiting >>> ravb_tx_timeout_work() >>> rtnl_lock() >>> // This is possible to cause a deadlock >>> >>> And, if rtnl_trylock() fails and the netif is still running, >>> rescheduling the work with 1 msec delayed. So, using Ah, you say 1 ms here but 10 ms in the code! Not good... :-) >>> schedule_delayed_work() instead of schedule_work(). >>> >>> Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") >>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@xxxxxxxxxxx> >>> Reviewed-by: Sergey Shtylyov <s.shtylyov@xxxxxx> >> >> Hm, I haven't reviewed this version... :-) > > Oops, I should have dropped the tag... > >>> Reviewed-by: Simon Horman <horms@xxxxxxxxxx> >> [...] >>> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h >>> index e0f8276cffed..e9bb8ee3ba2d 100644 >>> --- a/drivers/net/ethernet/renesas/ravb.h >>> +++ b/drivers/net/ethernet/renesas/ravb.h >>> @@ -1081,7 +1081,7 @@ struct ravb_private { >>> u32 cur_tx[NUM_TX_QUEUE]; >>> u32 dirty_tx[NUM_TX_QUEUE]; >>> struct napi_struct napi[NUM_RX_QUEUE]; >>> - struct work_struct work; >>> + struct delayed_work work; >> >> Not sure this is justified... >> Then again, what do I know about workqueues? Not much... :-) > > I thought that the schedule_work() called the work function immediately. > So, I thought call*ing the schedule_work() from the work function caused > endless loop. However, it is not true. The schedule_work() just inserts > a work queue, and then the kernel calls the work function later. > > So, changing from work_struct to delayed_work is not needed for fixing > this issue, I think now. However, I have another concern about rescheduling > this work by schedule_work() here because it's possible to cause high CPU load > while the rtnl_lock() is held. So, I think we should call a sleep function > like usleep_range(1000, 2000) for instance before schedule_work(). > But, what do you think? I think that a sleep before requeuing is pretty much the same as using a delayed work... >> [...] >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c >>> index c70cff80cc99..ca7db8a5b412 100644 >>> --- a/drivers/net/ethernet/renesas/ravb_main.c >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c >>> @@ -1863,17 +1863,24 @@ static void ravb_tx_timeout(struct net_device *ndev, unsigned int txqueue) >>> /* tx_errors count up */ >>> ndev->stats.tx_errors++; >>> >>> - schedule_work(&priv->work); >>> + schedule_delayed_work(&priv->work, 0); >>> } >>> >>> static void ravb_tx_timeout_work(struct work_struct *work) >>> { >>> - struct ravb_private *priv = container_of(work, struct ravb_private, >>> + struct delayed_work *dwork = to_delayed_work(work); >>> + struct ravb_private *priv = container_of(dwork, struct ravb_private, >>> work); >>> const struct ravb_hw_info *info = priv->info; >>> struct net_device *ndev = priv->ndev; >>> int error; >>> >>> + if (!rtnl_trylock()) { >>> + if (netif_running(ndev)) >>> + schedule_delayed_work(&priv->work, msecs_to_jiffies(10)); You could reuse dwork instead of &priv->work here... >> The delay is rather arbitrary. Why not e.g. 1 ms? > > I think that 1 ms is enough. Seeing now that 1 ms was intended... > Best regards, > Yoshihiro Shimoda [...] MBR, Sergey