Hi, I saw the following warning [1] twice when testing 1000Base-T SFP modules: [ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion [ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed [ 1481.699910] ------------[ cut here ]------------ [ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67 [ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec <snip> [ 1481.904544] macb ff0c0000.ethernet net1: Link is Down and a second time with some other errors too: [ 64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! [ 64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! [ 65.998108] cdns-i2c ff030000.i2c: timeout waiting on completion [ 66.010558] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed [ 66.017856] ------------[ cut here ]------------ [ 66.022786] phy_check_link_status+0x0/0xcc: returned: -67 [ 66.028255] WARNING: CPU: 0 PID: 70 at drivers/net/phy/phy.c:1233 phy_state_machine+0xa4/0x2b8 <snip> [ 66.339533] macb ff0c0000.ethernet net1: Link is Down The chain of events is: - The I2C transaction times out for some reason (in the latter case due to a known hardware bug). - mdio-i2c converts the error response to a 0xffff return value - genphy_read_lpa sees that LPA_1000MSFAIL is set in MII_STAT1000 and returns -ENOLINK. This propagates up the calls stack. - phy_check_link_status returns -ENOLINK - phy_error_precise forces the link down with state = PHY_ERROR. The problem with this is that although the register read fails due to a temporary condition, the link goes down permanently (or at least until the admin cycles the interface state). I think some part of the stack should implement a retry mechanism, but I'm not sure which part. One idea could be to have mdio-i2c propagate negative errors instead of converting them to successful reads of 0xffff. But we would still need to handle that in the phy driver or in phy_state_machine. - Are I2C bus drivers supposed to be flaky like this? That is, are callers of i2c_transfer expected to handle the occasional spurious error? - Similarly, are MDIO bus drivers allowed to be flaky? - Is ETIMEDOUT even supposed to be recoverable? Maybe we should have cdns-i2c return EAGAIN instead so it gets retried by the bus arbitration logic in __i2c_transfer. - ENOLINK really seems like something which we could recover from by resetting the phy (or even just waiting a bit). Maybe we should have the phy state machine just switch to PHY_NOLINK? Of course, the best option would be to fix cdns-i2c to not be buggy, but the hardware itself is buggy in at least one of the above cases so that may not be practical. --Sean