Re: NFS/RDMA RoCE with mlx4_en

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jack-

Thanks for your help!


> On Jun 27, 2017, at 6:33 AM, jackm <jackm@xxxxxxxxxxxxxxxxxx> wrote:
> 
> On Mon, 26 Jun 2017 13:24:11 -0400
> Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> 
>> Running various I/O stress workloads with iozone on an
>> NFSv3 mount using RDMA on RoCEv1 (FRWR).
>> 
> Hi Chuck, I have some questions to help us understand what is happening:
> 
> 1. What kernel are you running here?

v4.12-rc2


> 2. What is the underlying Linux distribution?

Oracle Linux 7.3


> 3. What FW is installed on the ConnectX-3 HCA?

2.40.7000


> 4. Is SRIOV enabled? (i.e., is there a line in a modprobe conf file:
>    options mlx4_core num_vfs=<integer greater than zero>

It was enabled in the BIOS, but all lines with "num_vfs=" in these
files are commented out. I disabled the BIOS setting, but no change
in behavior.


> 5. Could you dump the card's .ini file and sent it to us?
>   (flint dc -d <pci bus-dev-fn> dc connectx3.ini)

Attached. Let me know if it doesn't make it.


> 6. Is this a dual-port HCA, Are both ports connected?

Single port.


> 7. Could you try disabling the mlx4 driver automatic driver start at
> boot time?
> 
> 8. After disabling automatic start at boot time, could you reboot the
> host to see if it has problems without the mlx4 driver stack?

I unset CONFIG_CMA. The cma_alloc errors go away, but the mlx4
timeout / reset is unchanged.


> 9. The mlx4 device was reset because a timeout was detected for the
>   DUMP_ETH_STATS command (0x49). The timeout for this command is 60
>   seconds.  Did the message log show anything at around 1 minute before
>   the timeout occurred?

Nothing probative. Lots of "NFS server: not responding".


> 10. Do you know which app is calling cma_alloc?  If you are willing to
> modify your kernel code temporarily for this, you might put a
> stack_dump() in file mm/cma.c at line 454 (where the cma_alloc failure
> line is output).

In the process of collecting data for you, I noticed that
the CX3's maximum Ethernet link speed is 40Gbps, and I
had set the switch port speed to 56Gbps. I've set the
port speed back to 40Gbps, and now neither the device
reset nor the cma_alloc failures are reproducing.

If you'd like to pursue this further, I can switch back to
the higher speed and try to reproduce to collect this
information.


> Thanks, Chuck -- any help you can give us here will be greatly
> appreciated.
> 
> -Jack

--
Chuck Lever


Attachment: connectx3.ini
Description: Binary data


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux