On Sun, 2019-04-28 at 15:39 +-0800, Ming Lei wrote: +AD4 Now scsi+AF8-mq+AF8-setup+AF8-tags() pre-allocates a big buffer for IO sg list, +AD4 and the buffer size is scsi+AF8-mq+AF8-sgl+AF8-size() which depends on smaller +AD4 value between shost-+AD4-sg+AF8-tablesize and SG+AF8-CHUNK+AF8-SIZE. +AD4 +AD4 Modern HBA's DMA is often capable of deadling with very big segment +AD4 number, so scsi+AF8-mq+AF8-sgl+AF8-size() is often big. Suppose the max sg number +AD4 of SG+AF8-CHUNK+AF8-SIZE is taken, scsi+AF8-mq+AF8-sgl+AF8-size() will be 4KB. +AD4 +AD4 Then if one HBA has lots of queues, and each hw queue's depth is +AD4 high, pre-allocation for sg list can consume huge memory. +AD4 For example of lpfc, nr+AF8-hw+AF8-queues can be 70, each queue's depth +AD4 can be 3781, so the pre-allocation for data sg list is 70+ACo-3781+ACo-2k +AD4 +AD0-517MB for single HBA. +AD4 +AD4 There is Red Hat internal report that scsi+AF8-debug based tests can't +AD4 be run any more since legacy io path is killed because too big +AD4 pre-allocation. +AD4 +AD4 So switch to runtime allocation for sg list, meantime pre-allocate 2 +AD4 inline sg entries. This way has been applied to NVMe PCI for a while, +AD4 so it should be fine for SCSI too. Also runtime sg entries allocation +AD4 has verified and run always in the original legacy io path. +AD4 +AD4 Not see performance effect in my big BS test on scsi+AF8-debug. Reviewed-by: Bart Van Assche +ADw-bvanassche+AEA-acm.org+AD4