On Wed, Oct 2, 2024 at 12:34 PM Andrew Halaney <ahalaney@xxxxxxxxxx> wrote: > > On Tue, Oct 01, 2024 at 02:22:23PM GMT, Andrew Halaney wrote: > > On Mon, Sep 30, 2024 at 05:12:42PM GMT, Saravana Kannan wrote: > > > On Mon, Sep 30, 2024 at 2:28 PM Andrew Halaney <ahalaney@xxxxxxxxxx> wrote: > > > > > > > > Add support for parsing the phy-handle binding so that fw_devlink can > > > > enforce the dependency. This prevents MACs (that use this binding to > > > > claim they're using the corresponding phy) from probing prior to the > > > > phy, unless the phy is a child of the MAC (which results in a > > > > dependency cycle) or similar. > > > > > > > > For some motivation, imagine a device topology like so: > > > > > > > > ðernet0 { > > > > phy-mode = "sgmii"; > > > > phy-handle = <&sgmii_phy0>; > > > > > > > > mdio { > > > > compatible = "snps,dwmac-mdio"; > > > > sgmii_phy0: phy@8 { > > > > compatible = "ethernet-phy-id0141.0dd4"; > > > > reg = <0x8>; > > > > device_type = "ethernet-phy"; > > > > }; > > > > > > > > sgmii_phy1: phy@a { > > > > compatible = "ethernet-phy-id0141.0dd4"; > > > > reg = <0xa>; > > > > device_type = "ethernet-phy"; > > > > }; > > > > }; > > > > }; > > > > > > > > ðernet1 { > > > > phy-mode = "sgmii"; > > > > phy-handle = <&sgmii_phy1>; > > > > }; > > > > > > > > Here ethernet1 depends on sgmii_phy1 to function properly. In the below > > > > link an issue is reported where ethernet1 is probed and used prior to > > > > sgmii_phy1, resulting in a failure to get things working for ethernet1. > > > > With this change in place ethernet1 doesn't probe until sgmii_phy1 is > > > > ready, resulting in ethernet1 functioning properly. > > > > > > > > ethernet0 consumes sgmii_phy0, but this dependency isn't enforced > > > > via the device_links backing fw_devlink since ethernet0 is the parent of > > > > sgmii_phy0. Here's a log showing that in action: > > > > > > > > [ 7.000432] qcom-ethqos 23040000.ethernet: Fixed dependency cycle(s) with /soc@0/ethernet@23040000/mdio/phy@8 > > > > > > > > With this change in place ethernet1's dependency is properly described, > > > > and it doesn't probe prior to sgmii_phy1 being available. > > > > > > > > Link: https://lore.kernel.org/netdev/7723d4l2kqgrez3yfauvp2ueu6awbizkrq4otqpsqpytzp45q2@rju2nxmqu4ew/ > > > > Suggested-by: Serge Semin <fancer.lancer@xxxxxxxxx> > > > > Signed-off-by: Andrew Halaney <ahalaney@xxxxxxxxxx> > > > > --- > > > > I've marked this as an RFT because when looking through old mailing > > > > list discusssions and kernel tech talks on this subject, I was unable > > > > to really understand why in the past phy-handle had been left out. There > > > > were some loose references to circular dependencies (which seem more or > > > > less handled by fw_devlink to me), and the fact that a lot of behavior > > > > happens in ndo_open() (but I couldn't quite grok the concern there). > > > > > > > > I'd appreciate more testing by others and some feedback from those who > > > > know this a bit better to indicate whether fw_devlink is ready to handle > > > > this or not. > > > > > > > > At least in my narrow point of view, it's working well for me. > > > > > > I do want this to land and I'm fairly certain it'll break something. > > > But it's been so long that I don't remember what it was. I think it > > > has to do with the generic phy driver not working well with fw_devlink > > > because it doesn't go through the device driver model. > > > > Let me see if I can hack something up on this board (which has a decent > > dependency tree for testing this stuff) to use the generic phy driver > > instead of the marvell one that it needs and see how that goes. It won't > > *actually* work from a phy perspective, but it will at least test out > > the driver core bits here I think. > > > > > > > > But like you said, it's been a while and fw_devlink has improved since > > > then (I think). So please go ahead and give this a shot. If you can > > > help fix any issues this highlights, I'd really appreciate it and I'd > > > be happy to guide you through what I think needs to happen. But I > > > don't think I have the time to fix it myself. > > > > Sure, I tend to agree. Let me check the generic phy driver path for any > > issues and if that test seems to go okay I too am of the opinion that > > without any solid reasoning against this we enable it and battle through > > (revert and fix after the fact if necessary) any newly identified issues > > that prevent phy-handle and fw_devlink have with each other. > > > > Hmmm, yes the generic phy driver path for this > doesn't seem to work well. Its fine and dandy if there's > no device_link (current situation), but if there is one > (say with my patch and in my example above between ethernet1 and phy@a, > you can ignore the ethernet0 relationship since its a cycle > and therefore no device_link is created as mentioned in the patch) > you run into problems with the generic phy driver. > > In my original test you can see I use the marvell driver > for the phy. In that case things work well. In the generic phy > case however, ethernet1's probe is actually delayed far past > phy@a. Here's some logs that show that the device_link getting > "relaxed" due to no driver being bound, which has fw_devlink > thinking this supplier phy isn't going to get a driver ever, > so it finally tries to unblock (probe) the consumer (ethernet1): > > [ 40.695570] platform 23000000.ethernet: Relaxing link with stmmac-0:0a > [ 40.702274] CPU: 4 UID: 0 PID: 111 Comm: kworker/u34:1 Not tainted 6.12.0-rc1-next-20240930-00004-gb766c5527800-dirty #155 > [ 40.713605] Hardware name: Qualcomm SA8775P Ride (DT) > [ 40.718789] Workqueue: events_unbound deferred_probe_work_func > [ 40.724774] Call trace: > [ 40.727295] dump_backtrace+0x108/0x190 > [ 40.731233] show_stack+0x24/0x38 > [ 40.734638] dump_stack_lvl+0x40/0x88 > [ 40.738406] dump_stack+0x18/0x28 > [ 40.741811] fw_devlink_unblock_consumers+0x78/0xe8 > [ 40.746824] device_add+0x290/0x3f8 > [ 40.750411] phy_device_register+0x6c/0xd0 > [ 40.754615] fwnode_mdiobus_phy_device_register+0xe8/0x178 > [ 40.760246] fwnode_mdiobus_register_phy+0x214/0x268 > [ 40.765344] __of_mdiobus_parse_phys+0x80/0x280 > [ 40.769995] __of_mdiobus_register+0xd0/0x230 > [ 40.774465] stmmac_mdio_register+0x220/0x3c8 [stmmac] > [ 40.779755] stmmac_dvr_probe+0x91c/0xd70 [stmmac] > [ 40.784682] devm_stmmac_pltfr_probe+0x54/0xe0 [stmmac_platform] > [ 40.790846] qcom_ethqos_probe+0x404/0x438 [dwmac_qcom_ethqos] > [ 40.796830] platform_probe+0x94/0xd8 > > If I understand correctly that's because the generic phy driver > is bound during a MAC's (like ethernet1 here) phylink_fwnode_phy_connect() call > in ndo_open() currently.. here's another dump_stack() (yes I abuse that alot) > showing when that happens: > > [ 42.980611] net end1: Before phylink_fwnode_phy_connect > [ 42.986011] CPU: 4 UID: 0 PID: 310 Comm: NetworkManager Not tainted 6.12.0-rc1-next-20240930-00004-gb766c5527800-dirty #156 > [ 42.997436] Hardware name: Qualcomm SA8775P Ride (DT) > [ 43.002632] Call trace: > [ 43.005152] dump_backtrace+0x108/0x190 > [ 43.009106] show_stack+0x24/0x38 > [ 43.012518] dump_stack_lvl+0x40/0x88 > [ 43.016290] dump_stack+0x18/0x28 > [ 43.019701] phy_attach_direct+0x2d4/0x3e0 > [ 43.023918] phylink_fwnode_phy_connect+0xc4/0x178 > [ 43.028848] __stmmac_open+0x698/0x6e0 [stmmac] > [ 43.033534] stmmac_open+0x54/0xe0 [stmmac] > [ 43.037850] __dev_open+0x110/0x228 > [ 43.041442] __dev_change_flags+0xbc/0x1d0 > > > And here's the code for the binding of the generic phy driver: > > /** > * phy_attach_direct - attach a network device to a given PHY device pointer > * @dev: network device to attach > * @phydev: Pointer to phy_device to attach > * @flags: PHY device's dev_flags > * @interface: PHY device's interface > * > * Description: Called by drivers to attach to a particular PHY > * device. The phy_device is found, and properly hooked up > * to the phy_driver. If no driver is attached, then a > * generic driver is used. The phy_device is given a ptr to > * the attaching device, and given a callback for link status > * change. The phy_device is returned to the attaching driver. > * This function takes a reference on the phy device. > */ > int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, > u32 flags, phy_interface_t interface) > { > struct mii_bus *bus = phydev->mdio.bus; > struct device *d = &phydev->mdio.dev; > struct module *ndev_owner = NULL; > bool using_genphy = false; > int err; > > /* For Ethernet device drivers that register their own MDIO bus, we > * will have bus->owner match ndev_mod, so we do not want to increment > * our own module->refcnt here, otherwise we would not be able to > * unload later on. > */ > if (dev) > ndev_owner = dev->dev.parent->driver->owner; > if (ndev_owner != bus->owner && !try_module_get(bus->owner)) { > phydev_err(phydev, "failed to get the bus module\n"); > return -EIO; > } > > get_device(d); > > /* Assume that if there is no driver, that it doesn't > * exist, and we should use the genphy driver. > */ > if (!d->driver) { > if (phydev->is_c45) > d->driver = &genphy_c45_driver.mdiodrv.driver; > else > d->driver = &genphy_driver.mdiodrv.driver; > > using_genphy = true; > dump_stack(); > } > > if (!try_module_get(d->driver->owner)) { > phydev_err(phydev, "failed to get the device driver module\n"); > err = -EIO; > goto error_put_device; > } > > if (using_genphy) { > err = d->driver->probe(d); > if (err >= 0) > err = device_bind_driver(d); > > if (err) > goto error_module_put; > } > > ... > } > > Something will need to be done for the generic phy driver case before > this patch could be considered acceptable as this would slow the boot time > for the topology I described in the patch description if the generic phy > driver was used. Right. And the way to do that is to move the generic phy driver matching to go through the normal probe model instead of doing a direct driver attach or directly calling the probe function. It's possible and I had a mental model a while ago, but didn't have the time to get around to it. Basically, if we find that none of the drivers match, we need to trigger something like -EPROBE_DEFER again and then match it with the generic phy driver. Or figure out some other way for the generic phy driver to NOT match if a better driver is available. Once we do that, I think the rest should be easy to fix. Let me know if there's anything I can do to help you move this forward. -Saravana