Re: Large ib_srp configuration cannot allocate FRM pool space and fails map SCSI host

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 07/03/2017 03:00 PM, Laurence Oberman wrote:


On 07/02/2017 04:31 AM, Sagi Grimberg wrote:
Hello

This issue is apparent on RHEL and upstream 4.12-rc5 (that is what was tested)

Customer has a large configuration so I cannot reproduce this so asking if anybody else is aware of this.

We fail in FMR and we then keep incrementing the SCSI host#'s

Jun 20 14:01:42 xxxxxx kernel: fmr_pool: fmr_create failed for FMR 3809
Jun 20 14:01:42 xxxxxx kernel: fmr_pool: fmr_create failed for FMR 2005
Jun 20 14:01:42 xxxxxx kernel: scsi host7: ib_srp: FMR pool allocation failed (-12) Jun 20 14:01:42 xxxxxx kernel: scsi host8: ib_srp: FMR pool allocation failed (-12)

This repeats over and over.

* 7 pairs of enclosures / each with two controllers
* Each with an expansion tray
* Each side on the controller is going to its own IB switch
* Each node is connected to both switches each running its own subnet manager

Prior versions of RHEL i.e. 7.2 that don't have the newer code that also exists upstream are working.
Both RHEL7.3 and upstream are affected.

The configuration in place is:

for srp.conf
a max_sect=65535,max_cmd_per_lun=254,queue_size=254

for ib_srp
options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048 prefer_fr=N

How many fmrs did you end up allocating? what device is this?

Hello Sagi

I had them change srp.conf from:
a max_sect=65535,max_cmd_per_lun=254,queue_size=254

to:
a max_sect=4096,max_cmd_per_lun=64,queue_size=128

Now we can probe all rports and no longer see the disconnects and reconnects and FR allocation failures.

I guess the changes made to the FMR code upstream which we of course pulled into RHEL 7.3 will exceed allocations with the large rport count this customer has.

My test-bed is back to back with two mlx5 ports on the server and two on the client so I have been running with
a max_sect=65535,max_cmd_per_lun=254,queue_size=254
For a long time now.

So is this a tuning issue here as I am sure Red Hat, Bart and Mellanox have never tested this above with a 14 port remote array configuration.

Quite honestly,
a max_sect=4096,max_cmd_per_lun=64,queue_size=128
is very reasonable.

I think I will have them increase max_sect=16384 and we should still have enough space for allocations.

If you need more specifics like the actual FMR count as its currently configured let me know.
Simply let me have a list of sysfs variables you want captured.

Regards
Laurence


I did not give the device in my last update but had given it before.
This is an mlx4 configuration (cx3)

I expect the same would happen for an mlx5 (cx4) with this number of rports and would need to be tuned appropriately.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux