On Thu, Feb 21, 2019 at 06:15:14PM +0000, Christopher Lameter wrote: > On Wed, 20 Feb 2019, Michal Hocko wrote: > > > > I dont like the existing approaches but I can present them? > > > > Please give us at least some rough outline so that we can evaluate a > > general interest and see how/whether to schedule such a topic. > > Ok. I am fuzzy on this one too. Lets give this another shot: > > In the HPC world we often have to bypass operating system mechanisms for > full speed. Usually this has been through accellerators in the network > card, in sharing memory between multiple systems (with NUMA being a > special case of this) or with devices that provide some specialized memory > access. There is a whole issue here with pinned memory access (I think > that is handled in another session at the MM summit) > > The intend was typically to bring the data into the system so that an > application can act on it. However, with the increasing speeds of the > interconnect that may even be faster than the internal busses on > contemporary platforms that may have to change since the processor and the > system as a whole is no longer able to handle the inbound data stream. > This is partially due to the I/O bus speeds no longer increasing. > > The solutions to this issue coming from some vendors are falling > mostly into the following categories: > > A) Provide preprocessing in the NIC. > > This can compress data, modify it and direct it to certain cores of > the system. Preprocessing may allow multiple hosts to use one NIC > (Makes sense since a single host may no longer be able to handle the > data). > > B) Provide fast memory in the NIC > > Since the NIC is at capacity limits when it comes to pushing data > from the NIC into memory the obvious solution is to not go to main > memory but provide faster on NIC memory that can then be accessed > from the host as needed. Now the applications creates I/O bottlenecks > when accessing their data or they need to implement complicated > transfer mechanisms to retrieve and store data onto the NIC memory. > > C) Direct passthrough to other devices > > The host I/O bus is used or another enhanced bus is provided to reach > other system components without the constraints imposed by the OS or > hardware. This means for example that a NIC can directly write to an > NVME storage device (f.e. NVMEoF). A NIC can directly exchange data with > another NIC. In an extreme case a hardware addressable global data fabric > exists that is shared between multiple systems and the devices can > share memory areas with one another. In the ultra extreme case there > is a bypass even using the memory channels since non volatile memory > (a storage device essentially) is now supported that way. > > All of this leads to the development of numerous specialized accellerators > and special mechamisms to access memory on such devices. We already see a > proliferation of various remote memory schemes (HMM, PCI device memory > etc) > > So how does memory work in the systems of the future? It seems that we may > need some new way of tracking memory that is remote on some device in > additional to the classic NUMA nodes? Or can we change the existing NUMA > schemes to cover these use cases? > > We need some consistent and hopefully vendor neutral way to work with > memory I think. Note that i proposed a topic about that [1] NUMA is really hard to work with for device memory and adding memory that might not be cache coherent or not support atomic operation, is not a good idea to report as regular NUMA as existing application might start using such memory unaware of all its peculiarities. Anyway it is definitly a topic i beliew we need to discuss and i intend to present the problem from GPU/accelerator point of view (as today this are the hardware with sizeable fast local memory). Cheers, Jérôme [1] https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1904033.html > > > > > > ----- Old proposal > > > 00G Infiniband will become available this year. This means that the data > ingest speeds can be higher than the bandwidth of the processor > interacting with its own memory. > > For example a single hardware thread is limited to 20Gbyte/sec whereas the > network interface provides 50Gbytes/sec. These rates can only be obtained > currently with pinned memory. > > How can we evolve the memory management subsystem to operate at higher > speeds with more the comforts of paging and system calls that we are used > to? > > It is likely that these speeds with increase further and since the lead > processor vendor seems to be caught in a management induced corporate > suicide attempt we will not likely see any process on the processors from > there. The straightforward solution would be to use the high speed tech > for fabrics for the internal busses (doh!). Alternate processors are > likely to show up in 2019 and 2020 but those will take a long time to > mature. > > So what does the future hold and how do we scale up our HPC systems given > these problems? >