On Thu, Jul 06, 2023 at 12:48:46PM +0000, Olaf.Krzikalla@xxxxxx wrote: > Hi @all, > > creating connections via create_qp fails on our cluster for rather small numbers of processes (128 is working, 256 not) due to an out-of-memory error. I've tracked down the issue to an mlx5_alloc_buf call, which allocates ~500kB per call, which seems to be a lot. > > heaptrack tells me the following: > > 34.47M peak memory consumed over 92 calls from > mlx5_alloc_buf > in /usr/lib64/libibverbs/libmlx5-rdmav34.so > 8.65M consumed over 16 calls from: > create_qp > in /usr/lib64/libibverbs/libmlx5-rdmav34.so > mlx5_create_qp > in /usr/lib64/libibverbs/libmlx5-rdmav34.so > . > > Can anyone help me to understand, what causes a 500kB allocation in create_qp? Maybe it is some sort of a configuration issue, which I can handle somehow. > > Thanks for help and best regards > Olaf Krzikalla > > > System information: > CentOS Linux 7 (Core) > Linux 3.10.0-1160.88.1.el7.x86_64 Please contact your Nvidia support representative, you are talking about distro kernel and not linux upstream. Thanks > CA 'mlx5_0' > CA type: MT4123 > Number of ports: 1 > Firmware version: 20.33.1048 > Hardware version: 0 > > > > > >