On Wed, 20 Feb 2019, Michal Hocko wrote: > > I dont like the existing approaches but I can present them? > > Please give us at least some rough outline so that we can evaluate a > general interest and see how/whether to schedule such a topic. Ok. I am fuzzy on this one too. Lets give this another shot: In the HPC world we often have to bypass operating system mechanisms for full speed. Usually this has been through accellerators in the network card, in sharing memory between multiple systems (with NUMA being a special case of this) or with devices that provide some specialized memory access. There is a whole issue here with pinned memory access (I think that is handled in another session at the MM summit) The intend was typically to bring the data into the system so that an application can act on it. However, with the increasing speeds of the interconnect that may even be faster than the internal busses on contemporary platforms that may have to change since the processor and the system as a whole is no longer able to handle the inbound data stream. This is partially due to the I/O bus speeds no longer increasing. The solutions to this issue coming from some vendors are falling mostly into the following categories: A) Provide preprocessing in the NIC. This can compress data, modify it and direct it to certain cores of the system. Preprocessing may allow multiple hosts to use one NIC (Makes sense since a single host may no longer be able to handle the data). B) Provide fast memory in the NIC Since the NIC is at capacity limits when it comes to pushing data from the NIC into memory the obvious solution is to not go to main memory but provide faster on NIC memory that can then be accessed from the host as needed. Now the applications creates I/O bottlenecks when accessing their data or they need to implement complicated transfer mechanisms to retrieve and store data onto the NIC memory. C) Direct passthrough to other devices The host I/O bus is used or another enhanced bus is provided to reach other system components without the constraints imposed by the OS or hardware. This means for example that a NIC can directly write to an NVME storage device (f.e. NVMEoF). A NIC can directly exchange data with another NIC. In an extreme case a hardware addressable global data fabric exists that is shared between multiple systems and the devices can share memory areas with one another. In the ultra extreme case there is a bypass even using the memory channels since non volatile memory (a storage device essentially) is now supported that way. All of this leads to the development of numerous specialized accellerators and special mechamisms to access memory on such devices. We already see a proliferation of various remote memory schemes (HMM, PCI device memory etc) So how does memory work in the systems of the future? It seems that we may need some new way of tracking memory that is remote on some device in additional to the classic NUMA nodes? Or can we change the existing NUMA schemes to cover these use cases? We need some consistent and hopefully vendor neutral way to work with memory I think. ----- Old proposal 00G Infiniband will become available this year. This means that the data ingest speeds can be higher than the bandwidth of the processor interacting with its own memory. For example a single hardware thread is limited to 20Gbyte/sec whereas the network interface provides 50Gbytes/sec. These rates can only be obtained currently with pinned memory. How can we evolve the memory management subsystem to operate at higher speeds with more the comforts of paging and system calls that we are used to? It is likely that these speeds with increase further and since the lead processor vendor seems to be caught in a management induced corporate suicide attempt we will not likely see any process on the processors from there. The straightforward solution would be to use the high speed tech for fabrics for the internal busses (doh!). Alternate processors are likely to show up in 2019 and 2020 but those will take a long time to mature. So what does the future hold and how do we scale up our HPC systems given these problems?