On Wed, Feb 26, 2025 at 03:55:47PM +0000, Jon Hunter wrote: > > On 26/02/2025 10:59, Russell King (Oracle) wrote: > > On Wed, Feb 26, 2025 at 10:11:58AM +0000, Jon Hunter wrote: > > > > > > On 26/02/2025 10:02, Russell King (Oracle) wrote: > > > > On Tue, Feb 25, 2025 at 02:21:01PM +0000, Jon Hunter wrote: > > > > > Hi Russell, > > > > > > > > > > On 19/02/2025 20:57, Russell King (Oracle) wrote: > > > > > > So, let's try something (I haven't tested this, and its likely you > > > > > > will need to work it in to your other change.) > > > > > > > > > > > > Essentially, this disables the receive clock stop around the reset, > > > > > > something the stmmac driver has never done in the past. > > > > > > > > > > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > > > > > > index 1cbea627b216..8e975863a2e3 100644 > > > > > > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > > > > > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > > > > > > @@ -7926,6 +7926,8 @@ int stmmac_resume(struct device *dev) > > > > > > rtnl_lock(); > > > > > > mutex_lock(&priv->lock); > > > > > > + phy_eee_rx_clock_stop(priv->dev->phydev, false); > > > > > > + > > > > > > stmmac_reset_queues_param(priv); > > > > > > stmmac_free_tx_skbufs(priv); > > > > > > @@ -7937,6 +7939,9 @@ int stmmac_resume(struct device *dev) > > > > > > stmmac_restore_hw_vlan_rx_fltr(priv, ndev, priv->hw); > > > > > > + phy_eee_rx_clock_stop(priv->dev->phydev, > > > > > > + priv->phylink_config.eee_rx_clk_stop_enable); > > > > > > + > > > > > > stmmac_enable_all_queues(priv); > > > > > > stmmac_enable_all_dma_irq(priv); > > > > > > > > > > > > > > > Sorry for the delay, I have been testing various issues recently and needed > > > > > a bit more time to test this. > > > > > > > > > > It turns out that what I had proposed last week does not work. I believe > > > > > that with all the various debug/instrumentation I had added, I was again > > > > > getting lucky. So when I tested again this week on top of vanilla v6.14-rc2, > > > > > it did not work :-( > > > > > > > > > > However, what you are suggesting above, all by itself, is working. I have > > > > > tested this on top of vanilla v6.14-rc2 and v6.14-rc4 and it is working > > > > > reliably. I have also tested on some other boards that use the same stmmac > > > > > driver (but use the Aquantia PHY) and I have not seen any issues. So this > > > > > does fix the issue I am seeing. > > > > > > > > > > I know we are getting quite late in the rc for v6.14, but not sure if we > > > > > could add this as a fix? > > > > > > > > The patch above was something of a hack, bypassing the layering, so I > > > > would like to consider how this should be done properly. > > > > > > > > I'm still wondering whether the early call to phylink_resume() is > > > > symptomatic of this same issue, or whether there is a PHY that needs > > > > phy_start() to be called to output its clock even with link down that > > > > we don't know about. > > > > > > > > The phylink_resume() call is relevant to this because I'd like to put: > > > > > > > > phy_eee_rx_clock_stop(priv->dev->phydev, > > > > priv->phylink_config.eee_rx_clk_stop_enable); > > > > > > > > in there to ensure that the PHY is correctly configured for clock-stop, > > > > but given stmmac's placement that wouldn't work. > > > > > > > > I'm then thinking of phylink_pre_resume() to disable the EEE clock-stop > > > > at the PHY. > > > > > > > > I think the only thing we could do is try solving this problem as per > > > > above and see what the fall-out from it is. I don't get the impression > > > > that stmmac users are particularly active at testing patches though, so > > > > it may take months to get breakage reports. > > > > > > > > > We can ask Furong to test as he seems to active and making changes, but > > > otherwise I am not sure how well it is being tested across various devices. > > > On the other hand, it feels like there are still lingering issues like this > > > with the driver and so I would hope this is moving in the right direction. > > > > > > Let me know if you have a patch you want me to test and I will run in on our > > > Tegra186, Tegra194 and Tegra234 devices that all use this. > > > > Do we think this needs to be a patch for the net tree or the net-next > > tree? I think we've established that it's been a long-standing bug, > > so maybe if we target net-next to give it more time to be tested? > > > > Yes I agree there is a long-standing issue here. What is unfortunate for > Linux v6.14 is that failure rate is much higher. However, I don't see what I > can really do about that. I can mark suspend as broken for Linux v6.14 for > this device and then hopefully we will get this resolved properly. If we put the patches in net-next, it can have longer to be tested - it won't go straight into 6.14, but will wait until after net-next gets merged, and it'll then be backported to 6.14 stable trees. I think the fix that I've outlined is too big and too risky to go straight into 6.14, but the smaller fix may be better, but would then need to be rewritten into the larger fix. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!