Re: Memory management facing a 400Gpbs network link

Christopher Lameter <cl@xxxxxxxxx> · Thu, 21 Feb 2019 18:15:14 +0000

On Wed, 20 Feb 2019, Michal Hocko wrote:

> > I dont like the existing approaches but I can present them?
>
> Please give us at least some rough outline so that we can evaluate a
> general interest and see how/whether to schedule such a topic.

Ok. I am fuzzy on this one too. Lets give this another shot:

In the HPC world we often have to bypass operating system mechanisms for
full speed. Usually this has been through accellerators in the network
card, in sharing memory between multiple systems (with NUMA being a
special case of this) or with devices that provide some specialized memory
access. There is a whole issue here with pinned memory access (I think
that is handled in another session at the MM summit)

The intend was typically to bring the data into the system so that an
application can act on it. However, with the increasing speeds of the
interconnect that may even be faster than the internal busses on
contemporary platforms that may have to change since the processor and the
system as a whole is no longer able to handle the inbound data stream.
This is partially due to the I/O bus speeds no longer increasing.

The solutions to this issue coming from some vendors are falling
mostly into the following categories:

A) Provide preprocessing in the NIC.

   This can compress data, modify it and direct it to certain cores of
   the system. Preprocessing may allow multiple hosts to use one NIC
   (Makes sense since a single host may no longer be able to handle the
   data).

B) Provide fast memory in the NIC

   Since the NIC is at capacity limits when it comes to pushing data
   from the NIC into memory the obvious solution is to not go to main
   memory but provide faster on NIC memory that can then be accessed
   from the host as needed. Now the applications creates I/O bottlenecks
   when accessing their data or they need to implement complicated
   transfer mechanisms to retrieve and store data onto the NIC memory.

C) Direct passthrough to other devices

   The host I/O bus is used or another enhanced bus is provided to reach
   other system components without the constraints imposed by the OS or
   hardware. This means for example that a NIC can directly write to an
   NVME storage device (f.e. NVMEoF). A NIC can directly exchange data with
   another NIC. In an extreme case a hardware addressable global data fabric
   exists that is shared between multiple systems and the devices can
   share memory areas with one another. In the ultra extreme case there
   is a bypass  even using the memory channels since non volatile memory
   (a storage device essentially) is now  supported that way.

All of this leads to the development of numerous specialized accellerators
and special mechamisms to access memory on such devices. We already see a
proliferation of various remote memory schemes (HMM, PCI device memory
etc)

So how does memory work in the systems of the future? It seems that we may
need some new way of tracking memory that is remote on some device in
additional to the classic NUMA nodes? Or can we change the existing NUMA
schemes to cover these use cases?

We need some consistent and hopefully vendor neutral way to work with
memory I think.

----- Old proposal

00G Infiniband will become available this year. This means that the data
ingest speeds can be higher than the bandwidth of the processor
interacting with its own memory.

For example a single hardware thread is limited to 20Gbyte/sec whereas the
network interface provides 50Gbytes/sec. These rates can only be obtained
currently with pinned memory.

How can we evolve the memory management subsystem to operate at higher
speeds with more the comforts of paging and system calls that we are used
to?

It is likely that these speeds with increase further and since the lead
processor vendor seems to be caught in a management induced corporate
suicide attempt we will not likely see any process on the processors from
there. The straightforward solution would be to use the high speed tech
for fabrics for the internal busses (doh!). Alternate processors are
likely to show up in 2019 and 2020 but those will take a long time to
mature.

So what does the future hold and how do we scale up our HPC systems given
these problems?