Re: [PATCH] PHY: Fix no autoneg corner case

Heiner Kallweit <hkallweit1@xxxxxxxxx> · Thu, 28 Nov 2024 20:35:34 +0100



On 28.11.2024 15:54, Andrew Lunn wrote:
> On Thu, Nov 28, 2024 at 07:31:48AM +0100, Krzysztof Hałasa wrote:
>> Andrew,
>>
>> Andrew Lunn <andrew@xxxxxxx> writes:
>>
>>>> Unfortunately it's initially set based on the supported capability
>>>> rather than the actual hw setting.
>>>
>>> We need a clear definition of 'initially', and when does it actually
>>> matter.
>>>
>>> Initially, things like speed, duplex and set to UNKNOWN. They don't
>>> make any sense until the link is up. phydev->advertise is set to
>>> phydev->supported, so that we advertise all the capabilities of the
>>> PHY. However, at probe, this does not really matter, it is only when
>>> phy_start() is called is the hardware actually configured with what it
>>> should advertise, or even if it should do auto-neg or not.
>>>
>>> In the end, this might not matter.
>>
>> Nevertheless, it seems it does matter.
>>
>>>> While in most cases there is no
>>>> difference (i.e., autoneg is supported and on by default), certain
>>>> adapters (e.g. fiber optics) use fixed settings, configured in hardware.
>>>
>>> If the hardware is not capable of supporting autoneg, why is autoneg
>>> in phydev->supported? To me, that is the real issue here.
>>
>> Well, autoneg *IS* supported by the PHY in this case.
>> No autoneg in phydev->supported would mean I can't enable it if needed,
>> wouldn't it?
>>
>> It is supported but initially disabled.
>>
>> With current code, PHY correctly connects to the other side, all the
>> registers are valid etc., the PHY indicates, for example, a valid link
>> with 100BASE-FX full duplex etc.
>>
>> Yet the Linux netdev, ethtool etc. indicate no valid link, autoneg on,
>> and speed/duplex unknown. It's just completely inconsistent with the
>> real hardware state.
>>
>> It seems the phy/phylink code assumes the PHY starts with autoneg
>> enabled (if supported). This is simply an incorrect assumption.
> 
> This is sounding like a driver bug. When phy_start() is called it
> kicks off the PHY state machine. That should result in
> phy_config_aneg() being called. That function is badly named, since it
> is used both for autoneg and forced setting. The purpose of that call
> is to configure the PHY to the configuration stored in
> phydev->advertise, etc. So if the PHY by hardware defaults has autoneg
> disabled, but the configuration in phydev says it should be enabled,
> calling phy_config_aneg() should actually enabled autoneg. It is
> possible there is a phylib bug here, because we try to not to kick off
> autoneg if it is not needed, because it is slow. I've not looked at
> the code, but it could be we see there is link, and skip calling
> phy_config_aneg()? Maybe try booting with the cable disconnected so
> there is no link?
> 
If the PHY driver has no config_aneg() callback, then genphy_config_aneg()
-> genphy_check_and_restart_aneg() would set BMCR_ANENABLE.
Not sure about which PHY driver we're talking here, but if it has a
custom config_aneg(), then setting BMCR_ANENABLE may be missing there.

>> BTW if the code meant to enable autoneg, it would do exactly that -
>> enable it by writing to PHY command register.
> 
> Assuming bug free code.
> 
>> Then the hw and sw state
>> would be consistent again (though initial configuration would be
>> ignored, not very nice). Now the code doesn't enable autoneg, it only
>> *indicates* it's enabled and in reality it's not.
> 
> I would say there are two different issues here.
> 
> 1) It seems like we are not configuring the hardware to match phydev.
> 2) We are overwriting how the bootloader etc configured the hardware.
> 
> 2) is always hard, because how do we know the PHY is not messed up
> from a previous boot/crash cycle etc. In general, a driver should try
> to put the hardware into a well known state. If we have a clear use
> case for this, we can consider how to implement it.
> 
> 	Andrew
>