On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: > On 2 Oct 2020, at 4:30, David Hildenbrand wrote: > > > On 02.10.20 10:10, Michal Hocko wrote: > >> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: > >>>>>> - huge page sizes controllable by the userspace? > >>>>> > >>>>> It might be good to allow advanced users to choose the page sizes, so they > >>>>> have better control of their applications. > >>>> > >>>> Could you elaborate more? Those advanced users can use hugetlb, right? > >>>> They get a very good control over page size and pool preallocation etc. > >>>> So they can get what they need - assuming there is enough memory. > >>>> > >>> > >>> I am still not convinced that 1G THP (TGP :) ) are really what we want > >>> to support. I can understand that there are some use cases that might > >>> benefit from it, especially: > >> > >> Well, I would say that internal support for larger huge pages (e.g. 1GB) > >> that can transparently split under memory pressure is a useful > >> funtionality. I cannot really judge how complex that would be > > > > Right, but that's then something different than serving (scarce, > > unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing > > wrong about *real* THP support, meaning, e.g., grouping consecutive > > pages and converting them back and forth on demand. (E.g., 1GB -> > > multiple 2MB -> multiple single pages), for example, when having to > > migrate such a gigantic page. But that's very different from our > > existing gigantic page code as far as I can tell. > > Serving 1GB PUD THPs from CMA is a compromise, since we do not want to > bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, > which needs section size increase. In addition, unmoveable pages cannot > be allocated in CMA, so allocating 1GB pages has much higher chance from > it than from ZONE_NORMAL. s/higher chances/non-zero chances Currently we have nothing that prevents the fragmentation of the memory with unmovable pages on the 1GB scale. It means that in a common case it's highly unlikely to find a continuous GB without any unmovable page. As now CMA seems to be the only working option. However it seems there are other use cases for the allocation of continuous 1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using 1GB pages can reduce the fragmentation of the direct mapping. So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. E.g. something like a second level of pageblocks. That would allow to group all unmovable memory in few 1GB blocks and have more 1GB regions available for gigantic THPs and other use cases. I'm looking now into how it can be done. If anybody has any ideas here, I'll appreciate a lot. Thanks!