Hello, below is the bug when we try to transfer the cuda data through infiniband.
- Linux distribution and version: 2 workers with 80Gi memory and 2GPU
- Linux kernel and version: Linux n176-081-094 5.4.143.bsk.7-amd64 #5.4.143.bsk.7 SMP Debian 5.4.143.bsk.7 Mon Jul 4 02:44:16 UTC 2 x86_64 GNU/Linux
- InfiniBand hardware and firmware version: 22.36.1010
How to reproduce the bug
- Prepare two nodes
- Download the code and run `mkdir build && cd build && cmake .. -G Ninja`
- Set the receiver's IP address to `cuda_sender.cpp:51`
- Run `ninja`
- Run ./cuda_receiver on receive machine
- Run ./cuda_sender on send machine
The problem arises when I set the opcode of `ibv_send_wr` to `IBV_WR_SEND` and the data sent is too small (less than 9 float32). It appears that the sender can send data successfully, but the receiver will be segmentation fault when calling ibv_poll_cq.
It can be remedied with `IBV_WR_SEND` replaced by `IBV_WR_RDMA_WRITE_WITH_IMM` in ibv.h (and the remote address and remote key provided).
Since I'm not familiar with the behavior of each opcode in `ibv_send_wr`, could you instruct me whether the problem described above is considered as a but or expected?
Code to reproduce the problem is enclosed as attachment.
Thank you.
Attachment:
cuda_sender.cpp
Description: Binary data
Attachment:
ibv.h
Description: Binary data
Attachment:
cuda_receiver.cpp
Description: Binary data
Attachment:
CMakeLists.txt
Description: Binary data