Hi Jack- Thanks for your help! > On Jun 27, 2017, at 6:33 AM, jackm <jackm@xxxxxxxxxxxxxxxxxx> wrote: > > On Mon, 26 Jun 2017 13:24:11 -0400 > Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > >> Running various I/O stress workloads with iozone on an >> NFSv3 mount using RDMA on RoCEv1 (FRWR). >> > Hi Chuck, I have some questions to help us understand what is happening: > > 1. What kernel are you running here? v4.12-rc2 > 2. What is the underlying Linux distribution? Oracle Linux 7.3 > 3. What FW is installed on the ConnectX-3 HCA? 2.40.7000 > 4. Is SRIOV enabled? (i.e., is there a line in a modprobe conf file: > options mlx4_core num_vfs=<integer greater than zero> It was enabled in the BIOS, but all lines with "num_vfs=" in these files are commented out. I disabled the BIOS setting, but no change in behavior. > 5. Could you dump the card's .ini file and sent it to us? > (flint dc -d <pci bus-dev-fn> dc connectx3.ini) Attached. Let me know if it doesn't make it. > 6. Is this a dual-port HCA, Are both ports connected? Single port. > 7. Could you try disabling the mlx4 driver automatic driver start at > boot time? > > 8. After disabling automatic start at boot time, could you reboot the > host to see if it has problems without the mlx4 driver stack? I unset CONFIG_CMA. The cma_alloc errors go away, but the mlx4 timeout / reset is unchanged. > 9. The mlx4 device was reset because a timeout was detected for the > DUMP_ETH_STATS command (0x49). The timeout for this command is 60 > seconds. Did the message log show anything at around 1 minute before > the timeout occurred? Nothing probative. Lots of "NFS server: not responding". > 10. Do you know which app is calling cma_alloc? If you are willing to > modify your kernel code temporarily for this, you might put a > stack_dump() in file mm/cma.c at line 454 (where the cma_alloc failure > line is output). In the process of collecting data for you, I noticed that the CX3's maximum Ethernet link speed is 40Gbps, and I had set the switch port speed to 56Gbps. I've set the port speed back to 40Gbps, and now neither the device reset nor the cma_alloc failures are reproducing. If you'd like to pursue this further, I can switch back to the higher speed and try to reproduce to collect this information. > Thanks, Chuck -- any help you can give us here will be greatly > appreciated. > > -Jack -- Chuck Lever
Attachment:
connectx3.ini
Description: Binary data