On 4/11/2023 3:51 AM, Niklas Schnelle wrote: > After an error on the PCI link, the driver does not need to wait > for the link to become functional again as a reset is required. Stop > the wait loop in this case to accelerate the recovery flow. > Ok, so if the PCI link is completely offline (pci_channel_offline) then we just bail out immediately and fail to recover, reporting to the user as-such. Then a system administrator can setup in and perform the appropriate reset? Rather than not reporting until the timeout completes. Essentially, we know that this will never recover at this point so stop wasting time. Makes sense. > Co-developed-by: Alexander Schmidt <alexs@xxxxxxxxxxxxx> > Signed-off-by: Alexander Schmidt <alexs@xxxxxxxxxxxxx> > Reviewed-by: Leon Romanovsky <leonro@xxxxxxxxxx> > Link: https://lore.kernel.org/r/20230403075657.168294-1-schnelle@xxxxxxxxxxxxx > Signed-off-by: Niklas Schnelle <schnelle@xxxxxxxxxxxxx> > --- Reviewed-by: Jacob Keller <jacob.e.keller@xxxxxxxxx> > drivers/net/ethernet/mellanox/mlx5/core/health.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c > index f9438d4e43ca..81ca44e0705a 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c > @@ -325,6 +325,8 @@ int mlx5_health_wait_pci_up(struct mlx5_core_dev *dev) > while (sensor_pci_not_working(dev)) { > if (time_after(jiffies, end)) > return -ETIMEDOUT; > + if (pci_channel_offline(dev->pdev)) > + return -EIO; > msleep(100); > } > return 0; > @@ -332,10 +334,16 @@ int mlx5_health_wait_pci_up(struct mlx5_core_dev *dev) > > static int mlx5_health_try_recover(struct mlx5_core_dev *dev) > { > + int rc; > + > mlx5_core_warn(dev, "handling bad device here\n"); > mlx5_handle_bad_state(dev); > - if (mlx5_health_wait_pci_up(dev)) { > - mlx5_core_err(dev, "health recovery flow aborted, PCI reads still not working\n"); > + rc = mlx5_health_wait_pci_up(dev); > + if (rc) { > + if (rc == -ETIMEDOUT) > + mlx5_core_err(dev, "health recovery flow aborted, PCI reads still not working\n"); > + else > + mlx5_core_err(dev, "health recovery flow aborted, PCI channel offline\n"); > return -EIO; > } > mlx5_core_err(dev, "starting health recovery flow\n"); > > base-commit: 09a9639e56c01c7a00d6c0ca63f4c7c41abe075d