On Fri, 18 Jan 2019 12:45:13 -0500 Jerome Glisse <jglisse@xxxxxxxxxx> wrote: Hi Jerome, I held off on replying to this given we've had quite a few productive discussions about it in the past and I wanted to see what others came back with. They've had plenty of time, so I'll put my inputs on the table ;) > Hi, i would like to discuss about NUMA API and its short comings when > it comes to memory hierarchy (from fast HBM, to slower persistent > memory through regular memory) and also device memory (which can have > its own hierarchy). > > I have proposed a patch to add a new memory topology model to the > kernel for application to be able to get that informations, it > also included a set of new API to bind/migrate process range [1]. > Note that this model also support device memory. As an aside, I was a bit disappointed at the fact that current HMAT description being exported to userspace is currently limited to 'best' node only. This is obviously much simpler than what you propose, but even in that case we need examples to show how userspace can make use of the much richer information that is there and not currently made available. Right now the only way (I think) userspace can make use of that more detailed information is to parse HMAT directly. We can probably work with that to 'prove' the requirement but it's certainly ugly! > > So far device memory support is achieve through device specific ioctl > and this forbid some scenario like device memory interleaving accross > multiple devices for a range. It also make the whole userspace more > complex as program have to mix and match multiple device specific API > on top of NUMA API. > > While memory hierarchy can be more or less expose through the existing > NUMA API by creating node for non-regular memory [2], i do not see this > as a satisfying solution. Moreover such scheme does not work for device > memory that might not even be accessible by CPUs. I agree with this point even though I mostly care about 'normal' memory (be it in random places in the system). Hence my life is a little easier as correctness is easy even if performance is not. > > Hence i would like to discuss few points: > - What proof people wants to see this as problem we need to solve ? Agreed, this question in crucial to any discussion of more complex handling. I'm mostly interested in the 'easier' case of coherent 'normal' memory over CCIX. However, a lot of the questions around migration and topology are the same just perhaps simpler to implement. In CCIX we also have the major advantage that 'most' of our topology is discoverable by sufficiently clever userspace (excluding the host unfortunately). It does give us a 'playground' to look at some of these issues and we'll definitely be exploring them as more complex systems become readily available. As has been discussed before, we need to know who the user groups for this information actually are and the following questions: 1) Are they dealing with few enough hardware topologies that they can 'know' what they have to tune against? Still might need more advanced interfaces to do it, but they are likely to be device specific. This is perhaps the HPC world at the moment. This is a good group to work with if they are willing to prove the benefit, but do they justify a proper kernel description. Probably not if it's just them. 2) If not the above, but rather standard workstations or highly customizable systems, will the software be able to make the right decisions? To a degree, this last bit could just be a case of a library that can abstract away the complexity the the questions people actually want to answer (under a given list of constraints, including load information): a) Where should I run this code? b) Where should I store this data? My instinct is expose everything to userspace, but I appreciate that brings a very steep learning curve and chances are is near impossible to do in a sensible fashion. What I do care a lot about is exposing enough topology information that other data can be used intelligently. If I have a PMU on a particular interconnect I want to be able to tell which memory in my system is on which side of that interconnect. Right now I need the system manuals to find that out. Arguably those PMUs are sufficiently non standard that no generic software could use them anyway, but that is likely to change in the next year or two as standardization catches up with reality. > - How to build concensus to move forward on this ? A hard question indeed. My worry is we are still too early in the availability of these highly heterogeneous systems. Good to start making progress now, but it may be a while before we have clarity. I know you have systems that are, perhaps, rather less bleeding edge than mine, so your urgency to solve this may be higher! Having said that, there is clear demand from the hardware specifications bodies, for some idea of where operating systems are going, so that they can make decisions on exactly what level of self description their hardware should provide, to feed up the chain. I've been sat in meetings where hardware specs have not done this because we have no clarify on what the operating systems want. Much as with the firmware people, no one wants to specify information must be provided that nothing uses, or that might potentially be the 'wrong' information. Anyhow, hard and interesting topic. I'm sure this discussion and its follow ups keep us busy for a few years yet. Good to make a start and hopefully clarify the 'requirements' for any proposal as you've suggested. Jonathan > - What kind of syscall API people would like to see ? > > People to discuss this topic: > Dan Williams <dan.j.williams@xxxxxxxxx> > Dave Hansen <dave.hansen@xxxxxxxxx> > Felix Kuehling <Felix.Kuehling@xxxxxxx> > John Hubbard <jhubbard@xxxxxxxxxx> > Jonathan Cameron <jonathan.cameron@xxxxxxxxxx> > Keith Busch <keith.busch@xxxxxxxxx> > Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > Michal Hocko <mhocko@xxxxxxxxxx> > Paul Blinzer <Paul.Blinzer@xxxxxxx> > > Probably others, sorry if i miss anyone from previous discussions. > > Cheers, > Jérôme > > [1] https://lkml.org/lkml/2018/12/3/1072 > [2] https://lkml.org/lkml/2018/12/10/1112