On 2/9/22 1:41 AM, Chaitanya Kulkarni wrote: > On 2/8/22 6:50 PM, Martin Oliveira wrote: > > Hello, > > > > We have been hitting an error when running IO over our nvme-of setup, using the mlx5 driver and we are wondering if anyone has seen anything similar/has any suggestions. > > > > Both initiator and target are AMD EPYC 7502 machines connected over RDMA using a Mellanox MT28908. Target has 12 NVMe SSDs which are exposed as a single NVMe fabrics device, one physical SSD per namespace. > > > > Thanks for reporting this, if you can bisect the problem on your setup > it will help others to help you better. > > -ck Hi Chaitanya, I went back to a kernel as old as 4.15 and the problem was still there, so I don't know of a good commit to start from. I also learned that I can reproduce this with as little as 3 cards and I updated the firmware on the Mellanox cards to the latest version. I'd be happy to try any tests if someone has any suggestions. Thanks, Martin