RE: Issue about PCI physical slot fetch incorrect number

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Bjorn,
Sorry for the late response. And thanks for responding to my question.
There's a few thing I would like to clarify with you.
1. Is the physical slot number associate with the configuration of device itself or with the configuration of device's parent?
2. As my understanding, we also have another team using AMD GPU MI300. And I have discovered that lspci -xxx have some difference between our team(team 1)  
  and their team (team 2). The difference is that when we dump the file of lspci -xxx, the content only listed to 0xff, however, another team listed the content till 
  0xfff, which means that they have additional content from 0x100 to 0xfff.
  ->Is there any setting of OS that we can enable in order to see the whole content?
  ->Will these additional content related to the physical slot number? Or have any impact on showing the physical slot number?
3.Based on the response you gave:
 Slot numbering is messy because there are several sources of information, e.g., the Physical Slot Number in the Slot Capabilities register, SMBIOS table, ACPI _DSM     
 methods, etc., and they are not all coordinated.  So the kernel goes to some trouble to come up with a unique "slot number" for each slot.
 ->These will all organize into the path /sys/bus/pci/slots/. May I know how will them been organized, is there any specified code in lspci can we trace?
4.The attached file is our (team 1) partial lspci -tv、lspci -vvv、lspci -xxx and the other team's (team2) partial lspci -tv、lspci -vvv、lspci -xxx. Please find more details 
  in these attachment. Particularly, I would like to focus on the GPU region. As you can see in team 2's -vvv 3d:00.0, the MI300 GPU didn't show the physical  
  slot number, however, our team (team 1)'s -vvv 33:00.0, the GPU will show the physical slot number "0".
5.Also the screenshot of slot under /sys/bus/pci/slots is in attachment. I can't find the path you gave /sys/bus/pci/slots/*/address".
  ->After capturing the screenshot, I think that the original slot number list in the path /sys/bus/pci/slots is already incorrect. May you help me with how this file is constructed,  
  so that we may make some modification.

Really appreciate your help and hope to hear from you soon.
Thanks in advance.

BR,
Erin

-----Original Message-----
From: Bjorn Helgaas <helgaas@xxxxxxxxxx> 
Sent: Saturday, August 24, 2024 5:03 AM
To: Erin Tsao/WHQ/Wistron <Erin_Tsao@xxxxxxxxxxx>
Cc: Linux-PCI Mailing List <linux-pci@xxxxxxxxxxxxxxx>; Martin Mareš <mj@xxxxxx>
Subject: Re: Issue about PCI physical slot fetch incorrect number

Hi Erin, thanks for your question.

On Fri, Aug 23, 2024 at 08:51:58PM +0200, Martin Mareš wrote:
> Hi!
> 
> > This is Erin from Taiwan. I have a question about physical slot 
> > number.  Currently we are working on the PCIE slot number assigning 
> > by PCIE switch. In the PCIe slot assignment process, the slot 
> > numbers are assigned to bridges first, and then the end devices 
> > fetch the slot ID from the bridge in the upper layer.
> > 
> > I have observed that under our PCIE switch, GPUs will create a 
> > bridge before reaching the end device. If GPUs also fetch the slot 
> > ID from the upper bridge layer, they may retrieve incorrect values.
> > 
> > Our GPU will get the physical slot number with number “0”, and show 
> > the slot number “0”、”0-1” , etc.
> > May I ask
> > 
> >   1.  Why GPU will fetch the slot number “0”? Is the slot number
> >   assigned to GPU related to any register? Or can we set any bit
> >   to fetch the right number?
> >
> >   2.  Is there any possible for us not to show the physical slot
> >   number of GPU?

Can you supply logs showing what you see and what's incorrect?

For example, if lspci is showing the wrong thing, can you provide the complete output of "sudo lspci -vv" and indicate which things are wrong?

If the kernel dmesg log is wrong, can you supply that output and point out what's wrong?

Also, I think slots are exposed in /sys, so please include the output of "grep . /sys/bus/pci/slots/*/address".

Slot numbering is messy because there are several sources of information, e.g., the Physical Slot Number in the Slot Capabilities register, SMBIOS table, ACPI _DSM methods, etc., and they are not all coordinated.  So the kernel goes to some trouble to come up with a unique "slot number" for each slot.

Bjorn

---------------------------------------------------------------------------------------------------------------------------------------------------------------
This email contains confidential or legally privileged information and is for the sole use of its intended recipient.
Any unauthorized review, use, copying or distribution of this email or the content of this email is strictly prohibited.
If you are not the intended recipient, you may reply to the sender and should delete this e-mail immediately.
---------------------------------------------------------------------------------------------------------------------------------------------------------------

<<attachment: lspci_info_team1.zip>>

<<attachment: lspci_info_team2.zip>>

Attachment: slot_num.png
Description: slot_num.png


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux