Re: Potential regression/bug in net/mlx5 driver

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 29, 2023 at 6:20 PM Saeed Mahameed <saeed@xxxxxxxxxx> wrote:
> On 28 Mar 19:08, Paul Moore wrote:
> >Hello all,
> >
> >Starting with the v6.3-rcX kernel releases I noticed that my
> >InfiniBand devices were no longer present under /sys/class/infiniband,
> >causing some of my automated testing to fail.  It took me a while to
> >find the time to bisect the issue, but I eventually identified the
> >problematic commit:
> >
> >  commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
> >  Author: Shay Drory <shayd@xxxxxxxxxx>
> >  Date:   Wed Jun 29 11:38:21 2022 +0300
> >
> >   net/mlx5: Enable management PF initialization
> >
> >   Enable initialization of DPU Management PF, which is a new loopback PF
> >   designed for communication with BMC.
> >   For now Management PF doesn't support nor require most upper layer
> >   protocols so avoid them.
> >
> >   Signed-off-by: Shay Drory <shayd@xxxxxxxxxx>
> >   Reviewed-by: Eran Ben Elisha <eranbe@xxxxxxxxxx>
> >   Reviewed-by: Moshe Shemesh <moshe@xxxxxxxxxx>
> >   Signed-off-by: Saeed Mahameed <saeedm@xxxxxxxxxx>
> >
> >I'm not a mlx5 driver expert so I can't really offer much in the way
> >of a fix, but as a quick test I did remove the
> >'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
> >everything seemed to work okay on my test system (or rather the tests
> >ran without problem).
> >
> >If you need any additional information, or would like me to test a
> >patch, please let me know.
>
> Hi Paul,
>
> Our team is looking into this, the current theory is that you have an old
> FW that doesn't have the correct capabilities set.

That's very possible; I installed this card many years ago and haven't
updated the FW once.  I'm happy to update the FW (do you have a
pointer/how-to?), but it might be good to identify a fix first as I'm
guessing there will be others like me ...

> Can you please provide the FW version and the ConnectX device you are
> testing ?
>
> $ devlink dev info

% devlink dev info; echo $?
0

No output and no error code.  However, I do see the following in dmesg:

[  255.251124] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
959): fw query isn't supported by the FW

... which appears to support your theory about ancient hardware.

> $ lspci -s <pci_dev> -vv

While there is only one physical card, there are two PCI devices (it's
a dual port card).  I'm only copying the first device since I'm
guessing that's really all you need:

% lspci -s 00:07.0 -vv
00:07.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
       Subsystem: Mellanox Technologies Device 0010
       Physical Slot: 7
       Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
                Stepping- SERR+ FastB2B- DisINTx+
       Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
               <TAbort- <MAbort- >SERR- <PERR- INTx-
       Latency: 0, Cache Line Size: 64 bytes
       Interrupt: pin A routed to IRQ 11
       Region 0: Memory at fa000000 (64-bit, prefetchable) [size=32M]
       Expansion ROM at fe900000 [disabled] [size=1M]
       Capabilities: [60] Express (v2) Endpoint, MSI 00
               DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
                       unlimited, L1 unlimited
                       ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                       SlotPowerLimit 25W
               DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                       RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                       MaxPayload 256 bytes, MaxReadReq 512 bytes
               DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr-
                       TransPend-
               LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                       ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
               LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                       ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
               LnkSta: Speed 8GT/s, Width x8
                       TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
               DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP-
                        LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported,
                        ExtFmt- EETLPPrefix- EmergencyPowerReduction
                        Not Supported, EmergencyPowerReductionInit-
                        FRS- TPHComp- ExtTPHComp-
               AtomicOpsCap: 32bit- 64bit- 128bitCAS-
               DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR-
                        10BitTagReq- OBFF Disabled,
               AtomicOpsCtl: ReqEn-
               LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
                        EqualizationPhase1+ EqualizationPhase2+
                        EqualizationPhase3+ LinkEqualizationRequest-
                        Retimer- 2Retimers- CrosslinkRes: unsupported
       Capabilities: [48] Vital Product Data
               Product Name: CX454A - ConnectX-4 QSFP28
               Read-only fields:
                       [PN] Part number: MCX454A-FCAT
                       [EC] Engineering changes: AB
                       [SN] Serial number: MT1730X05081
                       [V0] Vendor specific: PCIeGen3 x8
                       [RV] Reserved: checksum good, 0 byte(s) reserved
               End
       Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
               Vector table: BAR=0 offset=00002000
               PBA: BAR=0 offset=00003000
       Capabilities: [c0] Vendor Specific Information: Len=18 <?>
       Capabilities: [40] Power Management version 3
               Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
                      PME(D0-,D1-,D2-,D3hot-,D3cold+)
               Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
       Kernel driver in use: mlx5_core
       Kernel modules: mlx5_core

> since boot:
> $ dmesg

% devlink dev info
% dmesg | grep mlx5
[    4.739691] mlx5_core 0000:00:07.0: firmware version: 12.18.1000
[    4.740134] mlx5_core 0000:00:07.0: 63.008 Gb/s available PCIe
bandwidth (8.0GT/s PCIe x8 link)
[    7.048567] mlx5_core 0000:00:07.0: Port module event: module 0,
Cable plugged
[    7.211879] mlx5_core 0000:00:08.0: firmware version: 12.18.1000
[    7.212309] mlx5_core 0000:00:08.0: 63.008 Gb/s available PCIe
bandwidth (8.0GT/s PCIe x8 link)
[    7.897218] mlx5_core 0000:00:08.0: Port module event: module 1,
Cable plugged
[   10.875388] mlx5_core 0000:00:07.0 ibs7: renamed from ib0
[   10.995115] mlx5_core 0000:00:08.0 ibs8: renamed from ib0
[  181.471663] mlx5_core 0000:00:07.0: mlx5_fw_version_query:823:(pid
918): fw query isn't supported by the FW
[  181.472286] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
918): fw query isn't supported by the FW

-- 
paul-moore.com




[Index of Archives]     [Selinux Refpolicy]     [Linux SGX]     [Fedora Users]     [Fedora Desktop]     [Yosemite Photos]     [Yosemite Camping]     [Yosemite Campsites]     [KDE Users]     [Gnome Users]

  Powered by Linux