Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64

Bart Van Assche <bvanassche@xxxxxxx> · Thu, 20 Mar 2025 08:37:05 -0700

On 3/20/25 7:18 AM, Christoph Hellwig wrote:
On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
We've been constrained to a max single 512 KiB IO for a while now on x86_64.

No, we absolutely haven't.  I'm regularly seeing multi-MB I/O on both
SCSI and NVMe setup.

Is NVME_MAX_KB_SZ the current maximum I/O size for PCIe NVMe
controllers? From drivers/nvme/host/pci.c:

/*
 * These can be higher, but we need to ensure that any command doesn't
 * require an sg allocation that needs more than a page of data.
 */
#define NVME_MAX_KB_SZ	8192
#define NVME_MAX_SEGS	128
#define NVME_MAX_META_SEGS 15
#define NVME_MAX_NR_ALLOCATIONS	5

This is due to the number of DMA segments and the segment size.

In nvme the max_segment_size is UINT_MAX, and for most SCSI HBAs it is
fairly large as well.

I have a question for NVMe device manufacturers. It is known since a
long time that submitting large I/Os with the NVMe SGL format requires
less CPU time compared to the NVMe PRP format. Is this sufficient to
motivate NVMe device manufacturers to implement the SGL format? All SCSI
controllers I know of, including UFS controllers, support something that
is much closer to the NVMe SGL format rather than the NVMe PRP format.

Thanks,

Bart.