If the rxe driver is able to generate packets faster than the IP code can process them it will start dropping packets and ip_local_out() will return NET_XMIT_DROP. The requester side of the driver detects this and retries the packet. The responder does not and the requester recovers by taking a retry timer delay and resubmitting the read operation from the last received packet. This can and does occur for large RDMA read responses for multi-MB reads. This causes a steep drop off in performance. This patch modifies read_reply() in rxe_resp.c to retry the send if err == -EAGAIN. When IP does drop a packet it requires more time to recover than a simple retry takes so a subroutine read_retry_delay() is added that dynamically estimates the time required for this recovery and inserts a delay before the retry. With this patch applied the performance of large reads is very stable. For example with a 1Gb/sec (112.5 GB/sec) Ethernet link between two systems, without this patch ib_read_bw shows the following performance. RDMA_Read BW Test Dual-port : OFF Device : rxe0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : OFF TX depth : 128 CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 2 Outstand reads : 128 rdma_cm QPs : OFF Data ex. method : Ethernet <snip> #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 2 1000 0.56 0.56 0.294363 4 1000 0.66 0.66 0.173862 8 1000 1.32 1.20 0.157406 16 1000 2.66 2.40 0.157357 32 1000 5.54 5.46 0.179006 64 1000 18.22 16.94 0.277533 128 1000 21.61 20.91 0.171322 256 1000 44.02 38.90 0.159316 512 1000 70.39 64.86 0.132843 1024 1000 106.50 100.49 0.102904 2048 1000 106.46 105.29 0.053908 4096 1000 107.85 107.85 0.027609 8192 1000 109.09 109.09 0.013963 16384 1000 110.17 110.17 0.007051 32768 1000 110.27 110.27 0.003529 65536 1000 110.33 110.33 0.001765 131072 1000 110.35 110.35 0.000883 262144 1000 110.36 110.36 0.000441 524288 1000 110.37 110.36 0.000221 1048576 1000 110.37 110.37 0.000110 2097152 1000 24.19 24.10 0.000012 4194304 1000 18.70 18.65 0.000005 8388608 1000 18.09 17.82 0.000002 No NET_XMIT_DROP returns are seen up to 1MiB but at 2MiB and above they are constant. With the patch applied ib_read_bw shows the following performance: RDMA_Read BW Test Dual-port : OFF Device : rxe0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : OFF TX depth : 128 CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 2 Outstand reads : 128 rdma_cm QPs : OFF Data ex. method : Ethernet <snip> #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 2 1000 0.34 0.33 0.175541 4 1000 0.69 0.68 0.179279 8 1000 2.02 1.75 0.229972 16 1000 2.72 2.63 0.172632 32 1000 5.42 4.94 0.161824 64 1000 10.63 9.67 0.158487 128 1000 31.06 28.11 0.230288 256 1000 40.48 36.75 0.150543 512 1000 70.00 66.00 0.135164 1024 1000 94.43 89.26 0.091402 2048 1000 106.38 104.34 0.053424 4096 1000 109.48 109.16 0.027946 8192 1000 108.96 108.96 0.013946 16384 1000 110.18 110.18 0.007052 32768 1000 110.28 110.28 0.003529 65536 1000 110.33 110.33 0.001765 131072 1000 110.35 110.35 0.000883 262144 1000 110.36 110.35 0.000441 524288 1000 110.35 110.31 0.000221 1048576 1000 110.37 110.37 0.000110 2097152 1000 110.37 110.37 0.000055 4194304 1000 110.37 110.36 0.000028 8388608 1000 110.37 110.37 0.000014 The delay algorithm computes approximately 50 usecs as the correct delay to insert before retrying a read_reply() send. Bob Pearson (1): RDMA-rxe: Allow retry sends for rdma read responses drivers/infiniband/sw/rxe/rxe_resp.c | 62 +++++++++++++++++++++++++-- drivers/infiniband/sw/rxe/rxe_verbs.h | 9 ++++ 2 files changed, 68 insertions(+), 3 deletions(-) base-commit: 91d088a0304941b88c915cc800617ff4068cdd39 -- 2.37.2