Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)

Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> · Mon, 24 Jul 2023 16:56:27 +0200

On 17/07/2023 03.53, Mina Almasry wrote:
On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote:

On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:

Once the skb frags with struct new_abstraction are in the TCP stack,
they will need some special handling in code accessing the frags. But
my RFC already addressed that somewhat because the frags were
inaccessible in that case. In this case the frags will be both
inaccessible and will not be struct pages at all (things like
get_page() will not work), so more special handling will be required,
maybe.

It seems sort of reasonable, though there will be interesting concerns
about coherence and synchronization with generial purpose DMABUFs that
will need tackling.

Still it is such a lot of churn and weridness in the netdev side, I
think you'd do well to present an actual full application as
justification.

Yes, you showed you can stick unordered TCP data frags into GPU memory
sort of quickly, but have you gone further with this to actually show
it is useful for a real world GPU centric application?

BTW your cover letter said 96% utilization, the usual server
configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
TCP BW with this?

I do notice that the number of NICs is missing from our public
documentation so far, so I will refrain from specifying how many NICs
are on those A3 VMs until the information is public. But I think I can
confirm that your general thinking is correct, the perf that we're
getting is 96.6% line rate of each GPU/NIC pair, 

What do you mean by 96.6% "line rate".
Is is the Ethernet line-rate?

Is the measured throughput the measured TCP data "goodput"?
Assuming
 - MTU 1500 bytes (1514 on wire).
 - Ethernet header 14 bytes
 - IP header 20 bytes
 - TCP header 20 bytes

Due to header overhead the goodput will be approx 96.4%.
 - (1514-(14+20+20))/1514 = 0.9643
 - (Not taking Ethernet interframe gap into account).

Thus, maybe you have hit Ethernet wire line-rate already?

and scales linearly
for each NIC/GPU pair we've tested with so far. Line rate of each
NIC/GPU pair is 200 Gb/sec.

So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.

Lets keep our units straight.
Here you mean 1545 Gbit/sec, which is 193 GBytes/s

If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec

Here you mean 384 Gbit/sec, which is 48 GBytes/sec.

...
etc.

These massive throughput numbers are important, because they *exceed*
the physical host RAM/DIMM memory speeds.

This is the *real argument* why software cannot afford to do a single
copy of the data from host-RAM into GPU-memory, because the CPU memory
throughput to DRAM/DIMM are insufficient.

My testlab CPU E5-1650 have 4 DIMM slots DDR4
 - Data Width: 64 bits (= 8 bytes)
 - Configured Memory Speed: 2400 MT/s
 - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)

Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).

When testing this with lmbench tool bw_mem, the results (below
signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
exceeding L3 cache size.  In practice it looks like main memory is
limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)

Given DDIO can deliver network packets into L3, I also tried to figure
out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
(339 Gbit/s), in hopes that it would be enough, but it was not.

--Jesper
(data below signature)

CPU under test:

 $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
 model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
 cache size	: 15360 KB

Providing some cmdline outputs from lmbench "bw_mem" tool.
(Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
256.00 14924.50

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
256.00 9895.25

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
256.00 9737.54

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
256.00 12462.88

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
256.00 14869.89

Next output shows reducing size below L3 cache size, which shows an
increase in speed, likely the L3 bandwidth.

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
64.00 14840.58

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
32.00 14823.97

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
16.00 24743.86

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
8.00 40852.26

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
4.00 42545.65

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
2.00 42447.82

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
1.00 42447.82