On 26/02/2025 16:00, Russell King (Oracle) wrote:
On Wed, Feb 26, 2025 at 03:55:47PM +0000, Jon Hunter wrote:
On 26/02/2025 10:59, Russell King (Oracle) wrote:
On Wed, Feb 26, 2025 at 10:11:58AM +0000, Jon Hunter wrote:
On 26/02/2025 10:02, Russell King (Oracle) wrote:
On Tue, Feb 25, 2025 at 02:21:01PM +0000, Jon Hunter wrote:
Hi Russell,
On 19/02/2025 20:57, Russell King (Oracle) wrote:
So, let's try something (I haven't tested this, and its likely you
will need to work it in to your other change.)
Essentially, this disables the receive clock stop around the reset,
something the stmmac driver has never done in the past.
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 1cbea627b216..8e975863a2e3 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -7926,6 +7926,8 @@ int stmmac_resume(struct device *dev)
rtnl_lock();
mutex_lock(&priv->lock);
+ phy_eee_rx_clock_stop(priv->dev->phydev, false);
+
stmmac_reset_queues_param(priv);
stmmac_free_tx_skbufs(priv);
@@ -7937,6 +7939,9 @@ int stmmac_resume(struct device *dev)
stmmac_restore_hw_vlan_rx_fltr(priv, ndev, priv->hw);
+ phy_eee_rx_clock_stop(priv->dev->phydev,
+ priv->phylink_config.eee_rx_clk_stop_enable);
+
stmmac_enable_all_queues(priv);
stmmac_enable_all_dma_irq(priv);
Sorry for the delay, I have been testing various issues recently and needed
a bit more time to test this.
It turns out that what I had proposed last week does not work. I believe
that with all the various debug/instrumentation I had added, I was again
getting lucky. So when I tested again this week on top of vanilla v6.14-rc2,
it did not work :-(
However, what you are suggesting above, all by itself, is working. I have
tested this on top of vanilla v6.14-rc2 and v6.14-rc4 and it is working
reliably. I have also tested on some other boards that use the same stmmac
driver (but use the Aquantia PHY) and I have not seen any issues. So this
does fix the issue I am seeing.
I know we are getting quite late in the rc for v6.14, but not sure if we
could add this as a fix?
The patch above was something of a hack, bypassing the layering, so I
would like to consider how this should be done properly.
I'm still wondering whether the early call to phylink_resume() is
symptomatic of this same issue, or whether there is a PHY that needs
phy_start() to be called to output its clock even with link down that
we don't know about.
The phylink_resume() call is relevant to this because I'd like to put:
phy_eee_rx_clock_stop(priv->dev->phydev,
priv->phylink_config.eee_rx_clk_stop_enable);
in there to ensure that the PHY is correctly configured for clock-stop,
but given stmmac's placement that wouldn't work.
I'm then thinking of phylink_pre_resume() to disable the EEE clock-stop
at the PHY.
I think the only thing we could do is try solving this problem as per
above and see what the fall-out from it is. I don't get the impression
that stmmac users are particularly active at testing patches though, so
it may take months to get breakage reports.
We can ask Furong to test as he seems to active and making changes, but
otherwise I am not sure how well it is being tested across various devices.
On the other hand, it feels like there are still lingering issues like this
with the driver and so I would hope this is moving in the right direction.
Let me know if you have a patch you want me to test and I will run in on our
Tegra186, Tegra194 and Tegra234 devices that all use this.
Do we think this needs to be a patch for the net tree or the net-next
tree? I think we've established that it's been a long-standing bug,
so maybe if we target net-next to give it more time to be tested?
Yes I agree there is a long-standing issue here. What is unfortunate for
Linux v6.14 is that failure rate is much higher. However, I don't see what I
can really do about that. I can mark suspend as broken for Linux v6.14 for
this device and then hopefully we will get this resolved properly.
If we put the patches in net-next, it can have longer to be tested - it
won't go straight into 6.14, but will wait until after net-next gets
merged, and it'll then be backported to 6.14 stable trees.
Yes that would be great.
I think the fix that I've outlined is too big and too risky to go
straight into 6.14, but the smaller fix may be better, but would then
need to be rewritten into the larger fix.
I think it is fine and better to get it fixed for the long term.
Thanks
Jon
--
nvpublic