RE: [bug report] nvme not re-recognize when changed M.2 SSD in S3 mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Lukas,

We use SDEX card replace M.2 nvme card because we don't have enough M.2 nvme card 
We tried this patch as below situation: 
1.keep the SDEX card enter S3 then resume ->PASS 
the video can continue playing, not see the msg "device replace during system sleep" 

2. on S3 mode insert the SDEX card then resume -> PASS
Can identify the card and can see the msg "device replace during system sleep"

3.on S3 mode remove the SDEX card then resume -> PASS
No card show on system and can see the msg "device replace during system sleep"

4.replace card on S3 mode (different brand) ->PASS
Can identify the second card and can see the msg "device replace during system sleep"

5.replace card on S3 mode (same brand and same capacity) ->FAIL
Can not see the msg "device replace during system sleep"
Against 5 we found a issue, most card not declare capability "PCI_EXT_CAP_ID_DSN",
even if there is the capability the config space value are 0, cause pci_get_dsn() return 0 normally.
However these cards can show the SN on CrystalDiskInfo. 

below is the card pci config space log and lsblk disk info.
Sandisk Corp Device 5007 (prog-if 02 [NVM Express]) this card can see the "PCI_EXT_CAP_ID_DSN" capability
But value is 0
--------------------------------------------------------------------------------------------------------------------
cr@cr-desktop:~$ sudo lspci -s 02:00.0 -vvv
02:00.0 Non-Volatile memory controller: Sandisk Corp Device 5007 (prog-if 02 [NVM Express])
	Subsystem: Sandisk Corp Device 5007
	Physical Slot: 8
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at a3b00000 (64-bit, non-prefetchable) [size=16K]
	Region 4: Memory at a3b04000 (64-bit, non-prefetchable) [size=256]
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=4 offset=00000000
	Capabilities: [c0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x1, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x1 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [1b8 v1] Latency Tolerance Reporting
		Max snoop latency: 3145728ns
		Max no snoop latency: 3145728ns
	Capabilities: [300 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Capabilities: [900 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
			  PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=90112ns
		L1SubCtl2: T_PwrOn=44us
	Kernel driver in use: nvme
	Kernel modules: nvme
--------------------------------------------------------------------------------------------------------------------

Here can see the Serial Number is "03969d74164000000000"
--------------------------------------------------------------------------------------------------------------------
cr@cr-desktop:~$ lsblk --nodeps -o name,size,type,mountpoint,serial
NAME      SIZE TYPE MOUNTPOINT                   SERIAL
loop0       4K loop /snap/bare/5                 
loop1    55.7M loop /snap/core18/2785            
loop2    55.7M loop /snap/core18/2790            
loop3    63.5M loop /snap/core20/1974            
loop4    91.7M loop /snap/gtk-common-themes/1535 
loop5    53.3M loop /snap/snapd/19457            
loop6   485.5M loop /snap/gnome-42-2204/120      
loop7    65.1M loop /snap/gtk-common-themes/1515 
loop8    73.9M loop /snap/core22/858             
loop9    12.3M loop /snap/snap-store/959         
loop10  485.5M loop /snap/gnome-42-2204/126      
loop11    219M loop /snap/gnome-3-34-1804/72     
loop12     51M loop /snap/snap-store/547         
loop13  218.4M loop /snap/gnome-3-34-1804/93     
loop14   63.9M loop /snap/sublime-text/122       
loop15  321.1M loop /snap/vlc/3721               
loop16   65.2M loop /snap/sublime-text/118       
sda     232.9G disk                              210350450505
nvme0n1 236.1G disk                              03969d74164000000000 
--------------------------------------------------------------------------------------------------------------------

we tried the 4 SDEX cards and 2 M.2 Nvme, 
either it is not declared or the value is 0 all the cards I have. 

RIcky

> Below please find a reworked patch which checks the Device Serial Number in
> addition to various other device identity information in config space.
> 
> I've moved the check for a replaced device to the ->resume_noirq phase, i.e.
> before the device driver's first access to the device.
> I'm also marking the device removed to prevent the driver from accessing it.
> 
> Does this fix the issue for you?
> 
> If it does reliably detect a replaced device, could you also double-check the
> opposite case, i.e. if the device is *not* replaced during system sleep?
> 
> I think this is about as much as we can do at the PCI layer to detect a replaced
> device.  Drivers may have additional ways to identify a device (such as
> reading a WWID from some register) and we could consider providing a library
> function which drivers could call if they detect a replaced device on resume.
> 
> If you set CONFIG_DYNAMIC_DEBUG=y and add the following to the command
> line...
> 
> pciehp.pciehp_debug=1 dyndbg="file pciehp* +p" log_buf_len=10M
> ignore_loglevel
> 
> ...you should see "device replaced during system sleep" messages on resume if
> a replaced device is detected.
> 
> -- >8 --
> 
> From: Lukas Wunner <lukas@xxxxxxxxx>
> Subject: [PATCH] PCI: pciehp: Detect device replacement during system sleep
> 
> Ricky reports that replacing a device in a hotplug slot during ACPI sleep state
> S3 does not cause re-enumeration on resume, as one would expect.  Instead,
> the new device is treated as if it was the old one.
> 
> There is no bulletproof way to detect device replacement, but as a heuristic,
> check whether the device identity in config space matches cached data in
> struct pci_dev (Vendor ID, Device ID, Class Code, Revision ID, Subsystem
> Vendor ID, Subsystem ID).  Additionally, cache and compare the Device Serial
> Number (PCIe r6.2 sec 7.9.3).  If a mismatch is detected, mark the old device
> disconnected (to prevent its driver from accessing the new device) and
> synthesize a Presence Detect Changed event.
> 
> Reported-by: Ricky Wu <ricky_wu@xxxxxxxxxxx>
> Closes:
> https://lore.kernel.org/r/a608b5930d0a48f092f717c0e137454b@xxxxxxxxxxx
> Signed-off-by: Lukas Wunner <lukas@xxxxxxxxx>
> ---
>  drivers/pci/hotplug/pciehp.h      |  4 ++++
>  drivers/pci/hotplug/pciehp_core.c | 42
> ++++++++++++++++++++++++++++++++++++++-
>  drivers/pci/hotplug/pciehp_hpc.c  |  5 +++++
> drivers/pci/hotplug/pciehp_pci.c  |  4 ++++
>  4 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h index
> e0a614a..273dd8c 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -46,6 +46,9 @@
>  /**
>   * struct controller - PCIe hotplug controller
>   * @pcie: pointer to the controller's PCIe port service device
> + * @dsn: cached copy of Device Serial Number of Function 0 in the hotplug
> slot
> + *     (PCIe r6.2 sec 7.9.3); used to determine whether a hotplugged device
> + *     was replaced with a different one during system sleep
>   * @slot_cap: cached copy of the Slot Capabilities register
>   * @inband_presence_disabled: In-Band Presence Detect Disable supported
> by
>   *     controller and disabled per spec recommendation (PCIe r5.0,
> appendix I
> @@ -87,6 +90,7 @@
>   */
>  struct controller {
>         struct pcie_device *pcie;
> +       u64 dsn;
> 
>         u32 slot_cap;                           /* capabilities and
> quirks */
>         unsigned int inband_presence_disabled:1; diff --git
> a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index ddd55ad..ff458e6 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -284,6 +284,32 @@ static int pciehp_suspend(struct pcie_device *dev)
>         return 0;
>  }
> 
> +static bool pciehp_device_replaced(struct controller *ctrl) {
> +       struct pci_dev *pdev __free(pci_dev_put);
> +       u32 reg;
> +
> +       pdev = pci_get_slot(ctrl->pcie->port->subordinate, PCI_DEVFN(0, 0));
> +       if (!pdev)
> +               return true;
> +
> +       if (pci_read_config_dword(pdev, PCI_VENDOR_ID, &reg) ||
> +           reg != (pdev->vendor | (pdev->device << 16)) ||
> +           pci_read_config_dword(pdev, PCI_CLASS_REVISION, &reg) ||
> +           reg != (pdev->revision | (pdev->class << 8)))
> +               return true;
> +
> +       if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL &&
> +           (pci_read_config_dword(pdev, PCI_SUBSYSTEM_VENDOR_ID,
> &reg) ||
> +            reg != (pdev->subsystem_vendor | (pdev->subsystem_device
> << 16))))
> +               return true;
> +
> +       if (pci_get_dsn(pdev) != ctrl->dsn)
> +               return true;
> +
> +       return false;
> +}
> +
>  static int pciehp_resume_noirq(struct pcie_device *dev)  {
>         struct controller *ctrl = get_service_data(dev); @@ -293,9 +319,23
> @@ static int pciehp_resume_noirq(struct pcie_device *dev)
>         ctrl->cmd_busy = true;
> 
>         /* clear spurious events from rediscovery of inserted card */
> -       if (ctrl->state == ON_STATE || ctrl->state == BLINKINGOFF_STATE)
> +       if (ctrl->state == ON_STATE || ctrl->state == BLINKINGOFF_STATE)
> + {
>                 pcie_clear_hotplug_events(ctrl);
> 
> +               /*
> +                * If hotplugged device was replaced with a different one
> +                * during system sleep, mark the old device disconnected
> +                * (to prevent its driver from accessing the new device)
> +                * and synthesize a Presence Detect Changed event.
> +                */
> +               if (pciehp_device_replaced(ctrl)) {
> +                       ctrl_dbg(ctrl, "device replaced during system
> sleep\n");
> +                       pci_walk_bus(ctrl->pcie->port->subordinate,
> +                                    pci_dev_set_disconnected,
> NULL);
> +                       pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
> +               }
> +       }
> +
>         return 0;
>  }
>  #endif
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c
> b/drivers/pci/hotplug/pciehp_hpc.c
> index b1d0a1b3..061f01f 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -1055,6 +1055,11 @@ struct controller *pcie_init(struct pcie_device
> *dev)
>                 }
>         }
> 
> +       pdev = pci_get_slot(subordinate, PCI_DEVFN(0, 0));
> +       if (pdev)
> +               ctrl->dsn = pci_get_dsn(pdev);
> +       pci_dev_put(pdev);
> +
>         return ctrl;
>  }
> 
> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> index ad12515..65e50be 100644
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -72,6 +72,10 @@ int pciehp_configure_device(struct controller *ctrl)
>         pci_bus_add_devices(parent);
>         down_read_nested(&ctrl->reset_lock, ctrl->depth);
> 
> +       dev = pci_get_slot(parent, PCI_DEVFN(0, 0));
> +       ctrl->dsn = pci_get_dsn(dev);
> +       pci_dev_put(dev);
> +
>   out:
>         pci_unlock_rescan_remove();
>         return ret;
> --
> 2.43.0






[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux