On 05.10.20 20:25, Roman Gushchin wrote: > On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote: >> On 05.10.20 19:16, Roman Gushchin wrote: >>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: >>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: >>>> >>>>> On 02.10.20 10:10, Michal Hocko wrote: >>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: >>>>>>>>>> - huge page sizes controllable by the userspace? >>>>>>>>> >>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they >>>>>>>>> have better control of their applications. >>>>>>>> >>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right? >>>>>>>> They get a very good control over page size and pool preallocation etc. >>>>>>>> So they can get what they need - assuming there is enough memory. >>>>>>>> >>>>>>> >>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want >>>>>>> to support. I can understand that there are some use cases that might >>>>>>> benefit from it, especially: >>>>>> >>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB) >>>>>> that can transparently split under memory pressure is a useful >>>>>> funtionality. I cannot really judge how complex that would be >>>>> >>>>> Right, but that's then something different than serving (scarce, >>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing >>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive >>>>> pages and converting them back and forth on demand. (E.g., 1GB -> >>>>> multiple 2MB -> multiple single pages), for example, when having to >>>>> migrate such a gigantic page. But that's very different from our >>>>> existing gigantic page code as far as I can tell. >>>> >>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to >>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, >>>> which needs section size increase. In addition, unmoveable pages cannot >>>> be allocated in CMA, so allocating 1GB pages has much higher chance from >>>> it than from ZONE_NORMAL. >>> >>> s/higher chances/non-zero chances >> >> Well, the longer the system runs (and consumes a significant amount of >> available main memory), the less likely it is. >> >>> >>> Currently we have nothing that prevents the fragmentation of the memory >>> with unmovable pages on the 1GB scale. It means that in a common case >>> it's highly unlikely to find a continuous GB without any unmovable page. >>> As now CMA seems to be the only working option. >>> >> >> And I completely dislike the use of CMA in this context (for example, >> allocating via CMA and freeing via the buddy by patching CMA when >> splitting up PUDs ...). >> >>> However it seems there are other use cases for the allocation of continuous >>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e= ), where using >>> 1GB pages can reduce the fragmentation of the direct mapping. >> >> Yes, see RFC v1 where I already cced Mike. >> >>> >>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. >>> E.g. something like a second level of pageblocks. That would allow to group >>> all unmovable memory in few 1GB blocks and have more 1GB regions available for >>> gigantic THPs and other use cases. I'm looking now into how it can be done. >> >> Anything bigger than sections is somewhat problematic: you have to track >> that data somewhere. It cannot be the section (in contrast to pageblocks) > > Well, it's not a large amount of data: the number of 1GB regions is not that > high even on very large machines. Yes, but then you can have very sparse systems. And some use cases would actually want to avoid fragmentation on smaller levels (e.g., 128MB) - optimizing memory efficiency by turning off banks and such ... > >> >>> If anybody has any ideas here, I'll appreciate a lot. >> >> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That >> somewhat mimics what CMA does (when sized reasonably), works well with >> memory hot(un)plug, and is immune to misconfiguration. Within such a >> zone, we can try to optimize the placement of larger blocks. > > Thank you for pointing at it! > > The main problem with it is the same as with ZONE_MOVABLE: it does require > a boot-time educated guess on a good size. I admit that the CMA does too. "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from highmem times) ares usually perfectly fine. And if you mess up - in comparison to CMA - you won't shoot yourself in the foot, you get less gigantic pages - which is usually better than before. I consider that a clear win. Perfect? No. Can we be perfect? unlikely. In comparison to CMA / ZONE_MOVABLE, a bad guess won't cause instabilities. -- Thanks, David / dhildenb