NFS/RDMA RoCE with mlx4_en

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Running various I/O stress workloads with iozone on an
NFSv3 mount using RDMA on RoCEv1 (FRWR).

Jun 26 12:50:21 morisot kernel: mlx4_core 0000:01:00.0: command 0x49 timed out (go bit not cleared)
Jun 26 12:50:21 morisot kernel: mlx4_core 0000:01:00.0: device is going to be reset
Jun 26 12:50:22 morisot kernel: mlx4_core 0000:01:00.0: device was reset successfully
Jun 26 12:50:22 morisot kernel: mlx4_en 0000:01:00.0: Internal error detected, restarting device
Jun 26 12:50:22 morisot kernel: <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started
Jun 26 12:50:22 morisot kernel: <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended
Jun 26 12:50:22 morisot kernel: ib_srpt received unrecognized IB event 8
Jun 26 12:50:22 morisot kernel: mlx4_core 0000:01:00.0: command 0x1e failed: fw status = 0x1

Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 17 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: mlx4_en: 0000:01:00.0: Port 1: Using 32 TX rings
Jun 26 12:50:32 morisot kernel: mlx4_en: 0000:01:00.0: Port 1: Using 4 RX rings
Jun 26 12:50:32 morisot kernel: mlx4_en: 0000:01:00.0: Port 1: Initializing port
Jun 26 12:50:32 morisot kernel: mlx4_en 0000:01:00.0: registered PHC clock
Jun 26 12:50:32 morisot kernel: <mlx4_ib> mlx4_ib_add: counter index 1 for port 1 allocated 1
Jun 26 12:50:32 morisot NetworkManager[810]: <info>  [1498495832.7797] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/3)
Jun 26 12:50:32 morisot kernel: mlx4_core 0000:01:00.0 enp1s0: renamed from eth0
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot NetworkManager[810]: <info>  [1498495832.7962] device (eth0): interface index 4 renamed iface from 'eth0' to 'enp1s0'
Jun 26 12:50:32 morisot NetworkManager[810]: <info>  [1498495832.7971] device (enp1s0): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 20 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 24 pages, ret: -12
Jun 26 12:50:32 morisot kernel: IPv6: ADDRCONF(NETDEV_UP): enp1s0: link is not ready
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 12:50:32 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12

Within a few more moments, the system became unreachable and
had to be restarted via remote power-on reset.

The cma_alloc failures start at system boot time and continue
whenever the mlx4_en device is used. Those don't seem like
unreasonably sized requests.

I tried reproducing this immediately. Similar result:

Jun 26 13:16:47 morisot kernel: nfs: server klimt-roce not responding, still trying
Jun 26 13:16:47 morisot kernel: nfs: server klimt-roce not responding, still trying
Jun 26 13:16:47 morisot kernel: nfs: server klimt-roce not responding, still trying
Jun 26 13:16:47 morisot kernel: nfs: server klimt-roce not responding, still trying
Jun 26 13:16:51 morisot kernel: mlx4_core 0000:01:00.0: command 0x49 timed out (go bit not cleared)
Jun 26 13:16:51 morisot kernel: mlx4_core 0000:01:00.0: device is going to be reset
Jun 26 13:16:52 morisot kernel: mlx4_core 0000:01:00.0: device was reset successfully
Jun 26 13:16:52 morisot kernel: mlx4_en 0000:01:00.0: Internal error detected, restarting device
Jun 26 13:16:52 morisot kernel: <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started
Jun 26 13:16:52 morisot kernel: <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended
Jun 26 13:16:52 morisot kernel: ib_srpt received unrecognized IB event 8
Jun 26 13:16:52 morisot kernel: mlx4_core 0000:01:00.0: command 0x1e failed: fw status = 0x1
Jun 26 13:16:52 morisot kernel: cma: cma_alloc: alloc failed, req-size: 129 pages, ret: -12
Jun 26 13:16:55 morisot kernel: mlx4_core 0000:01:00.0: Internal error mark was detected on device

Jun 26 13:16:55 morisot nm-dispatcher: req:1 'down' [enp1s0]: new request (4 scripts)
Jun 26 13:16:55 morisot nm-dispatcher: req:1 'down' [enp1s0]: start running ordered scripts...
Jun 26 13:16:55 morisot kernel: mlx4_en 0000:01:00.0: removed PHC
Jun 26 13:16:55 morisot kernel: rpcrdma: removing device mlx4_0 for 192.168.3.5:20049
Jun 26 13:16:55 morisot kernel: mlx4_core 0000:01:00.0: Fail to set mac in port 1 during unregister
Jun 26 13:16:58 morisot kernel: alloc_contig_range: [41e417, 41e418) PFNs busy
Jun 26 13:16:58 morisot kernel: alloc_contig_range: [41e418, 41e419) PFNs busy
Jun 26 13:17:00 morisot kernel: rpcrdma_ep_recreate_xprt: r_xprt = ffff8804090f8000
Jun 26 13:17:01 morisot kernel: alloc_contig_range: [41e417, 41e418) PFNs busy
Jun 26 13:17:01 morisot kernel: alloc_contig_range: [41e418, 41e419) PFNs busy
Jun 26 13:17:01 morisot kernel: alloc_contig_range: [41e4c0, 41e500) PFNs busy
Jun 26 13:17:01 morisot kernel: mlx4_core 0000:01:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
Jun 26 13:17:01 morisot kernel: mlx4_core 0000:01:00.0: PCIe link width is x8, device supports x8
Jun 26 13:17:01 morisot kernel: alloc_contig_range: [41e417, 41e418) PFNs busy
Jun 26 13:17:01 morisot kernel: alloc_contig_range: [41e418, 41e419) PFNs busy
Jun 26 13:17:01 morisot kernel: alloc_contig_range: [41e417, 41e418) PFNs busy
Jun 26 13:17:01 morisot kernel: alloc_contig_range: [41e418, 41e419) PFNs busy

Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 13:17:02 morisot kernel: mlx4_en 0000:01:00.0: Activating port:1
Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -12
Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 8 pages, ret: -12
Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 8 pages, ret: -12
Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 8 pages, ret: -12
Jun 26 13:17:02 morisot kernel: cma: cma_alloc: alloc failed, req-size: 8 pages, ret: -12


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux