[LSF/MM BPF TOPIC] NUMA topology metrics for NVMe-oF

Hannes Reinecke <hare@xxxxxxx> · Tue, 20 Feb 2024 09:03:08 +0100

Hi all,

having recently played around with CXL I started to wonder which 
impllication that would have for NVMe-over-Fabrics, and how the path 
selection would play out on such a system.

Thing is, with heavy NUMA systems we really should have a look at
the inter-node latencies, especially as the HW latencies are getting
closer to the NUMA latencies: for an Intel two socket node I'm seeing
latencies of around 200ns, and it's not unheard of getting around 5M 
IOPS from the device, which results in a latency of 2000ns.
And that's on PCI4.0. With PCI5 or CXL one expects the latency to 
decrease even further.

So I think that we should need to look at factor in the NUMA topology
for PCI devices, too. We do have a NUMA I/O policy, but that only looks
at the latency between nodes.
What we're missing is a NUMA latency for the PCI devices themselves.

So this discussion would be around how we could model (or even measure)
the PCI latency, and how we could modify the NVMe-oF iopolicies to take 
the NUMA latencies into account when selecting the 'best' path.

Cheers,

Hannes
--
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@xxxxxxx                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich