Re: [PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list

Bart Van Assche <bvanassche@xxxxxxx> · Mon, 29 Apr 2019 11:16:38 -0700



On Sun, 2019-04-28 at 15:39 +-0800, Ming Lei wrote:
+AD4 Now scsi+AF8-mq+AF8-setup+AF8-tags() pre-allocates a big buffer for IO sg list,
+AD4 and the buffer size is scsi+AF8-mq+AF8-sgl+AF8-size() which depends on smaller
+AD4 value between shost-+AD4-sg+AF8-tablesize and SG+AF8-CHUNK+AF8-SIZE.
+AD4 
+AD4 Modern HBA's DMA is often capable of deadling with very big segment
+AD4 number, so scsi+AF8-mq+AF8-sgl+AF8-size() is often big. Suppose the max sg number
+AD4 of SG+AF8-CHUNK+AF8-SIZE is taken, scsi+AF8-mq+AF8-sgl+AF8-size() will be 4KB.
+AD4 
+AD4 Then if one HBA has lots of queues, and each hw queue's depth is
+AD4 high, pre-allocation for sg list can consume huge memory.
+AD4 For example of lpfc, nr+AF8-hw+AF8-queues can be 70, each queue's depth
+AD4 can be 3781, so the pre-allocation for data sg list is 70+ACo-3781+ACo-2k
+AD4 +AD0-517MB for single HBA.
+AD4 
+AD4 There is Red Hat internal report that scsi+AF8-debug based tests can't
+AD4 be run any more since legacy io path is killed because too big
+AD4 pre-allocation.
+AD4 
+AD4 So switch to runtime allocation for sg list, meantime pre-allocate 2
+AD4 inline sg entries. This way has been applied to NVMe PCI for a while,
+AD4 so it should be fine for SCSI too. Also runtime sg entries allocation
+AD4 has verified and run always in the original legacy io path.
+AD4 
+AD4 Not see performance effect in my big BS test on scsi+AF8-debug.

Reviewed-by: Bart Van Assche +ADw-bvanassche+AEA-acm.org+AD4