On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote: > On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote: > > On machines with multiple memory nodes, interleaving page allocations > > across nodes allows for better utilization of each node's bandwidth. > > Previous work by Gregory Price [1] introduced weighted interleave, which > > allowed for pages to be allocated across NUMA nodes according to > > user-set ratios. > > I still don't get it. You always want memory to be on the local node or > the fabric gets horribly congested and slows you right down. But you're > not really talking about NUMA, are you? You're talking about CXL. > > And CXL is terrible for bandwidth. I just ran the numbers. > > On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs, > each with a bandwidth of 38.4GB/s for a total of 300GB/s. > > For each CXL lane, you take a lane of PCIe gen5 away. So that's > notionally 32Gbit/s, or 4GB/s per lane. But CXL is crap, and you'll be > lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s. > You're not going to use all 80 lanes for CXL (presumably these CPUs are > going to want to do I/O somehow), so maybe allocate 20 of them to CXL. > That's 60GB/s, or a 20% improvement in bandwidth. On top of that, > it's slow, with a minimum of 10ns latency penalty just from the CXL > encode/decode penalty. >