On 05.10.20 19:16, Roman Gushchin wrote: > On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: >> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: >> >>> On 02.10.20 10:10, Michal Hocko wrote: >>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: >>>>>>>> - huge page sizes controllable by the userspace? >>>>>>> >>>>>>> It might be good to allow advanced users to choose the page sizes, so they >>>>>>> have better control of their applications. >>>>>> >>>>>> Could you elaborate more? Those advanced users can use hugetlb, right? >>>>>> They get a very good control over page size and pool preallocation etc. >>>>>> So they can get what they need - assuming there is enough memory. >>>>>> >>>>> >>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want >>>>> to support. I can understand that there are some use cases that might >>>>> benefit from it, especially: >>>> >>>> Well, I would say that internal support for larger huge pages (e.g. 1GB) >>>> that can transparently split under memory pressure is a useful >>>> funtionality. I cannot really judge how complex that would be >>> >>> Right, but that's then something different than serving (scarce, >>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing >>> wrong about *real* THP support, meaning, e.g., grouping consecutive >>> pages and converting them back and forth on demand. (E.g., 1GB -> >>> multiple 2MB -> multiple single pages), for example, when having to >>> migrate such a gigantic page. But that's very different from our >>> existing gigantic page code as far as I can tell. >> >> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to >> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, >> which needs section size increase. In addition, unmoveable pages cannot >> be allocated in CMA, so allocating 1GB pages has much higher chance from >> it than from ZONE_NORMAL. > > s/higher chances/non-zero chances Well, the longer the system runs (and consumes a significant amount of available main memory), the less likely it is. > > Currently we have nothing that prevents the fragmentation of the memory > with unmovable pages on the 1GB scale. It means that in a common case > it's highly unlikely to find a continuous GB without any unmovable page. > As now CMA seems to be the only working option. > And I completely dislike the use of CMA in this context (for example, allocating via CMA and freeing via the buddy by patching CMA when splitting up PUDs ...). > However it seems there are other use cases for the allocation of continuous > 1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using > 1GB pages can reduce the fragmentation of the direct mapping. Yes, see RFC v1 where I already cced Mike. > > So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. > E.g. something like a second level of pageblocks. That would allow to group > all unmovable memory in few 1GB blocks and have more 1GB regions available for > gigantic THPs and other use cases. I'm looking now into how it can be done. Anything bigger than sections is somewhat problematic: you have to track that data somewhere. It cannot be the section (in contrast to pageblocks) > If anybody has any ideas here, I'll appreciate a lot. I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That somewhat mimics what CMA does (when sized reasonably), works well with memory hot(un)plug, and is immune to misconfiguration. Within such a zone, we can try to optimize the placement of larger blocks. -- Thanks, David / dhildenb