On 05.10.20 21:11, Roman Gushchin wrote: > On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote: >> On 05.10.20 20:25, Roman Gushchin wrote: >>> On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote: >>>> On 05.10.20 19:16, Roman Gushchin wrote: >>>>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote: >>>>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote: >>>>>> >>>>>>> On 02.10.20 10:10, Michal Hocko wrote: >>>>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote: >>>>>>>>>>>> - huge page sizes controllable by the userspace? >>>>>>>>>>> >>>>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they >>>>>>>>>>> have better control of their applications. >>>>>>>>>> >>>>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right? >>>>>>>>>> They get a very good control over page size and pool preallocation etc. >>>>>>>>>> So they can get what they need - assuming there is enough memory. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want >>>>>>>>> to support. I can understand that there are some use cases that might >>>>>>>>> benefit from it, especially: >>>>>>>> >>>>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB) >>>>>>>> that can transparently split under memory pressure is a useful >>>>>>>> funtionality. I cannot really judge how complex that would be >>>>>>> >>>>>>> Right, but that's then something different than serving (scarce, >>>>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing >>>>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive >>>>>>> pages and converting them back and forth on demand. (E.g., 1GB -> >>>>>>> multiple 2MB -> multiple single pages), for example, when having to >>>>>>> migrate such a gigantic page. But that's very different from our >>>>>>> existing gigantic page code as far as I can tell. >>>>>> >>>>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to >>>>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator, >>>>>> which needs section size increase. In addition, unmoveable pages cannot >>>>>> be allocated in CMA, so allocating 1GB pages has much higher chance from >>>>>> it than from ZONE_NORMAL. >>>>> >>>>> s/higher chances/non-zero chances >>>> >>>> Well, the longer the system runs (and consumes a significant amount of >>>> available main memory), the less likely it is. >>>> >>>>> >>>>> Currently we have nothing that prevents the fragmentation of the memory >>>>> with unmovable pages on the 1GB scale. It means that in a common case >>>>> it's highly unlikely to find a continuous GB without any unmovable page. >>>>> As now CMA seems to be the only working option. >>>>> >>>> >>>> And I completely dislike the use of CMA in this context (for example, >>>> allocating via CMA and freeing via the buddy by patching CMA when >>>> splitting up PUDs ...). >>>> >>>>> However it seems there are other use cases for the allocation of continuous >>>>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e= ), where using >>>>> 1GB pages can reduce the fragmentation of the direct mapping. >>>> >>>> Yes, see RFC v1 where I already cced Mike. >>>> >>>>> >>>>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale. >>>>> E.g. something like a second level of pageblocks. That would allow to group >>>>> all unmovable memory in few 1GB blocks and have more 1GB regions available for >>>>> gigantic THPs and other use cases. I'm looking now into how it can be done. >>>> >>>> Anything bigger than sections is somewhat problematic: you have to track >>>> that data somewhere. It cannot be the section (in contrast to pageblocks) >>> >>> Well, it's not a large amount of data: the number of 1GB regions is not that >>> high even on very large machines. >> >> Yes, but then you can have very sparse systems. And some use cases would >> actually want to avoid fragmentation on smaller levels (e.g., 128MB) - >> optimizing memory efficiency by turning off banks and such ... > > It's a definitely a good question. Oh, and I forgot that there might be users that want bigger granularity :) (primarily, memory hotunplug that wants to avoid ZONE_MOVABLE but still have higher chances to eventually unplug some memory) > >>> >>>> >>>>> If anybody has any ideas here, I'll appreciate a lot. >>>> >>>> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That >>>> somewhat mimics what CMA does (when sized reasonably), works well with >>>> memory hot(un)plug, and is immune to misconfiguration. Within such a >>>> zone, we can try to optimize the placement of larger blocks. >>> >>> Thank you for pointing at it! >>> >>> The main problem with it is the same as with ZONE_MOVABLE: it does require >>> a boot-time educated guess on a good size. I admit that the CMA does too. >> >> "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from >> highmem times) ares usually perfectly fine. And if you mess up - in >> comparison to CMA - you won't shoot yourself in the foot, you get less >> gigantic pages - which is usually better than before. I consider that a >> clear win. Perfect? No. Can we be perfect? unlikely. > > I'm not necessarily opposing your idea, I just think it will be tricky > to not introduce an additional overhead if the ratio is not perfectly > chosen. And there is simple a cost of adding a zone. Not sure this will be really visible - and if your kernel requires more than 20%..50% unmovable data than something is usually really fishy/special. The nice thing is that Linux will try to "auto-optimize" within each zone already. My gut feeling is that it's way easier to teach Linux (add zone, add mmop_type, build zonelists, split memory similar to movablecore) - however, that doesn't imply that it's better. We'll have to see. > > But fundamentally we're speaking about the same thing: grouping pages > by their movability on a smaller scale. With a new zone we'll split > pages into two parts with a fixed border, with new pageblock layer > in 1GB blocks. I also discussed moving the border on demand, which is way more tricky and would definitely be stuff for the future. There are some papers about similar fragmentation-avoidance techniques, mostly in the context of energy efficiency IIRC. Especially: - PALLOC: https://ieeexplore.ieee.org/document/6925999 - Adaptive-buddy: https://ieeexplore.ieee.org/document/7397629?reload=true&arnumber=7397629 IIRC, the problem about such approaches is that they are quite invasive and degrade some workloads due to overhead. > > I think the agreement is that we need such functionality. Yeah, on my long todo list. I'll be prototyping ZONE_RPEFER_MOVABLE soon, to see how it looks/feels/performs. -- Thanks, David / dhildenb