On 7/9/2024 3:52 AM, Ilpo Järvinen wrote:
On Tue, 25 Jun 2024, Smita Koralahalli wrote:
Sorry for the delay here. Took some time to find a system to run experiments.
Comments inline.
On 6/19/2024 12:47 AM, Lukas Wunner wrote:
On Tue, Jun 18, 2024 at 02:23:21PM -0700, Smita Koralahalli wrote:
On 6/18/2024 11:51 AM, Smita Koralahalli wrote:
But IIUC LBMS is set by hardware but never cleared by hardware, so
if
we remove a device and power off the slot, it doesn't seem like
LBMS
could be telling us anything useful (what could we do in response
to
LBMS when the slot is empty?), so it makes sense to me to clear
it.
It seems like pciehp_unconfigure_device() does sort of PCI core
and
driver-related things and possibly could be something shared by
all
hotplug drivers, while remove_board() does things more specific to
the
hotplug model (pciehp, shpchp, etc).
From that perspective, clearing LBMS might fit better in
remove_board(). In that case, I wonder whether it should be done
after turning off slot power? This patch clears is *before*
turning
off the power, so I wonder if hardware could possibly set it again
before the poweroff?
While clearing LBMS in remove_board() here:
if (POWER_CTRL(ctrl)) {
pciehp_power_off_slot(ctrl);
+ pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
PCI_EXP_LNKSTA_LBMS);
/*
* After turning power off, we must wait for at least 1 second
* before taking any action that relies on power having been
* removed from the slot/adapter.
*/
msleep(1000);
/* Ignore link or presence changes caused by power off */
atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
&ctrl->pending_events);
}
This can happen too right? I.e Just after the slot poweroff and before
LBMS
clearing the PDC/PDSC could be fired. Then
pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and
proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and
ultimately link speed drops..
So, I added clearing just before turning off the slot.. Let me know if I'm
thinking it right.
I guess I should have experimented before putting this comment out.
After talking to the HW/FW teams, I understood that, none of our CRBs support
power controller for NVMe devices, which means the "Power Controller Present"
in Slot_Cap is always false. That's what makes it a "surprise removal." If the
OS was notified beforehand and there was a power controller attached, the OS
would turn off the power with SLOT_CNTL. That's an "orderly" removal. So
essentially, the entire block from "if (POWER_CTRL(ctrl))" will never be
executed for surprise removal for us.
There could be board designs outside of us, with power controllers for the
NVME devices, which I'm not aware of.
This was added by 3943af9d01e9 ("PCI: pciehp: Ignore Link State Changes
after powering off a slot"). You can try reproducing it by writing "0"
to the slot's "power" file in sysfs, but your hardware needs to support
slot power.
Basically the idea is that after waiting for 1 sec, chances are very low
that any DLLSC or PDSC events caused by removing slot power may still
occur.
PDSC events occurring in our case aren't by removing slot power. It
should/will always happen on a surprise removal along with DLLSC for us. But
this PDSC is been delayed and happens after DLLSC is invoked and ctrl->state =
OFF_STATE in pciehp_disable_slot(). So the PDSC is mistook to enable slot in
pciehp_enable_slot() inside pciehp_handle_presence_or_link_change().
Arguably the same applies to LBMS changes, so I'd recommend to likewise
clear stale LBMS after the msleep(1000).
pciehp_ctrl.c only contains the state machine and higher-level logic of
the hotplug controller and all the actual register accesses are in helpers
in pciehp_hpc.c. So if you want to do it picture-perfectly, add a helper
in pciehp_hpc.c to clear LBMS and call that from remove_board().
That all being said, I'm wondering how this plays together with Ilpo's
bandwidth control driver?
https://lore.kernel.org/all/20240516093222.1684-1-ilpo.jarvinen@xxxxxxxxxxxxxxx/
I need to yet do a thorough reading of Ilpo's bandwidth control driver. Ilpo
please correct me if I misspeak something as I don't have a thorough
understanding.
Ilpo's bandwidth controller also checks for lbms count to be greater than zero
to bring down link speeds if CONFIG_PCIE_BWCTRL is true. If false, it follows
the default path to check LBMS bit in link status register. So if,
CONFIG_PCIE_BWCTRL is disabled by default we continue to see link speed drops.
Even, if BWCTRL is enabled, LBMS count is incremented to 1 in
pcie_bwnotif_enable() so likely pcie_lbms_seen() might return true thereby
bringing down speeds here as well if DLLLA is clear?
I did add code to clear the LBMS count in pciehp_unconfigure_device() in
part thanks to this patch of yours. Do you think it wouldn't work?
It works with BWCTRL enabled. Just my concern would be to keep the
clearing in pciehp_unconfigure_device() and not do it inside
POWER_CTRL(ctrl), in remove_board() as per the suggestions given above.
But I agree there would still be problem if BWCTRL is not enabled. I
already have to keep part of it enabled due to the Target Speed quirk
and now this is another case where just having it always on would be
beneficial.
Correct, it should always be on to not see the problem.
Would like to have this "LBMS clearing fix" accepted in sooner since its
breaking things on our systems! :)
Thanks
Smita.
IIUC, the bandwidth control driver will be in charge of handling LBMS
changes. So clearing LBMS behind the bandwidth control driver's back
might be problematic. Ilpo?
Yes, BW controller will take control of LBMS and other code should not
touch it directly (and LBMS will be kept cleared by the BW controller).
However, in this case I'll just need to adapt the code to replace the
LBMS clearing with resetting the LBMS count (if this patch is accepted
before BW controller), the resetting is already there anyway.
Also, since you've confirmed that this issue is fallout from
a89c82249c37 ("PCI: Work around PCIe link training failures"),
I'm wondering if the logic introduced by that commit can be
changed so that the quirk is applied more narrowly, i.e. *not*
applied to unaffected hardware, such as AMD's hotplug ports.
That would avoid the need to undo the effect of the quirk and
work around the downtraining you're seeing.
Maciej, any ideas?
Yeah I'm okay to go down to that approach as well. Any ideas would be helpful
here.
One thing I don't like in the Target Speed quirk is that it leaves the
Link Speed into the lower value if the quirk fails to bring the link up,
the quirk could restore the original Link Speed on failure to avoid these
problems. I even suggested that earlier, however, the downside of
restoring the original Link Speed is that it will require triggering yet
another retraining (perhaps we could avoid waiting for its completion
though since we expect it to fail).
It might be possible to eventually trigger the Target Speed quirk from the
BW controller but it would require writing some state machine so that the
quirk is not repeatedly attempted. It seemed to complicate things too much
to add such a state machine at this point.