Am 11.03.25 um 14:22 schrieb Andrew Lunn:
On Tue, Mar 11, 2025 at 01:30:54PM +0100, Fiona Klute wrote:
Am 10.03.25 um 22:27 schrieb Andrew Lunn:
On Mon, Mar 10, 2025 at 05:59:31PM +0100, Fiona Klute wrote:
If a new reset event appears before the previous one has been
processed, the device can get stuck into a reset loop. This happens
rarely, but blocks the device when it does, and floods the log with
messages like the following:
lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
The only bit that the driver pays attention to in the interrupt data
is "link was reset". If there's a flapping status bit in that endpoint
data (such as if PHY negotiation needs a few tries to get a stable
link), polling at a slower rate allows the state to settle.
Could you expand on this a little bit more. What is the issue you are
seeing?
What happens is that *sometimes* when the interface is activated (up, im
my case via NetworkManager) during boot, the "kevent 4 may have been
dropped" message starts to be emitted about every 6 or 7 ms.
This sounding a bit like an interrupt storm. The PHY interrupt is not
being cleared correctly. PHY interrupts are level interrupts, so if
you don't clear the interrupt at the source, it will fire again as
soon as you re-enable it.
So which PHY driver is being used? If you look for the first kernel
message about the lan78xx it probably tells you.
[ 27.918335] Call trace:
[ 27.918338] console_flush_all+0x2b0/0x4f8 (P)
[ 27.918346] console_unlock+0x8c/0x170
[ 27.918352] vprintk_emit+0x238/0x3b8
[ 27.918357] dev_vprintk_emit+0xe4/0x1b8
[ 27.918364] dev_printk_emit+0x64/0x98
[ 27.918368] __netdev_printk+0xc8/0x228
[ 27.918376] netdev_info+0x70/0xa8
[ 27.918382] phy_print_status+0xcc/0x138
[ 27.918386] lan78xx_link_status_change+0x78/0xb0
[ 27.918392] phy_link_change+0x38/0x70
[ 27.918398] phy_check_link_status+0xa8/0x110
[ 27.918405] _phy_start_aneg+0x5c/0xb8
[ 27.918409] lan88xx_link_change_notify+0x5c/0x128
[ 27.918416] _phy_state_machine+0x12c/0x2b0
[ 27.918420] phy_state_machine+0x34/0x80
[ 27.918425] process_one_work+0x150/0x3b8
[ 27.918432] worker_thread+0x2a4/0x4b8
[ 27.918438] kthread+0xec/0xf8
[ 27.918442] ret_from_fork+0x10/0x20
[ 27.918534] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
[ 27.924985] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
Ah, O.K. This tells me the PHY is a lan88xx. And there is a workaround
involved for an issue in this PHY. Often PHYs are driven by polling
for status changes once per second. Not all PHYs/boards support
interrupts. It could be this workaround has only been tested with
polling, not interrupts, and so is broken when interrupts are used.
As a quick hack test, in lan78xx_phy_init()
/* if phyirq is not set, use polling mode in phylib */
if (dev->domain_data.phyirq > 0)
phydev->irq = dev->domain_data.phyirq;
else
phydev->irq = PHY_POLL;
Hard code phydev->irq to PHY_POLL, so interrupts are not used.
See if you can reproduce the issue when interrupts are not used.
Thank you, I'll test that. Given the issue appears rarely it'll
unfortunately take a while to be (mostly) sure.
Best regards,
Fiona