Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

David Hildenbrand <david@xxxxxxxxxx> · Tue, 6 Oct 2020 10:25:44 +0200

On 05.10.20 21:11, Roman Gushchin wrote:
> On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote:
>> On 05.10.20 20:25, Roman Gushchin wrote:
>>> On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
>>>> On 05.10.20 19:16, Roman Gushchin wrote:
>>>>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>>>>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 02.10.20 10:10, Michal Hocko wrote:
>>>>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>>>>>>>> - huge page sizes controllable by the userspace?
>>>>>>>>>>>
>>>>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>>>>>>>> have better control of their applications.
>>>>>>>>>>
>>>>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>>>>>>>> They get a very good control over page size and pool preallocation etc.
>>>>>>>>>> So they can get what they need - assuming there is enough memory.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>>>>>>>> to support. I can understand that there are some use cases that might
>>>>>>>>> benefit from it, especially:
>>>>>>>>
>>>>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>>>>>>>> that can transparently split under memory pressure is a useful
>>>>>>>> funtionality. I cannot really judge how complex that would be
>>>>>>>
>>>>>>> Right, but that's then something different than serving (scarce,
>>>>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
>>>>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
>>>>>>> pages and converting them back and forth on demand. (E.g., 1GB ->
>>>>>>> multiple 2MB -> multiple single pages), for example, when having to
>>>>>>> migrate such a gigantic page. But that's very different from our
>>>>>>> existing gigantic page code as far as I can tell.
>>>>>>
>>>>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
>>>>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
>>>>>> which needs section size increase. In addition, unmoveable pages cannot
>>>>>> be allocated in CMA, so allocating 1GB pages has much higher chance from
>>>>>> it than from ZONE_NORMAL.
>>>>>
>>>>> s/higher chances/non-zero chances
>>>>
>>>> Well, the longer the system runs (and consumes a significant amount of
>>>> available main memory), the less likely it is.
>>>>
>>>>>
>>>>> Currently we have nothing that prevents the fragmentation of the memory
>>>>> with unmovable pages on the 1GB scale. It means that in a common case
>>>>> it's highly unlikely to find a continuous GB without any unmovable page.
>>>>> As now CMA seems to be the only working option.
>>>>>
>>>>
>>>> And I completely dislike the use of CMA in this context (for example,
>>>> allocating via CMA and freeing via the buddy by patching CMA when
>>>> splitting up PUDs ...).
>>>>
>>>>> However it seems there are other use cases for the allocation of continuous
>>>>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
>>>>> 1GB pages can reduce the fragmentation of the direct mapping.
>>>>
>>>> Yes, see RFC v1 where I already cced Mike.
>>>>
>>>>>
>>>>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
>>>>> E.g. something like a second level of pageblocks. That would allow to group
>>>>> all unmovable memory in few 1GB blocks and have more 1GB regions available for
>>>>> gigantic THPs and other use cases. I'm looking now into how it can be done.
>>>>
>>>> Anything bigger than sections is somewhat problematic: you have to track
>>>> that data somewhere. It cannot be the section (in contrast to pageblocks)
>>>
>>> Well, it's not a large amount of data: the number of 1GB regions is not that
>>> high even on very large machines.
>>
>> Yes, but then you can have very sparse systems. And some use cases would
>> actually want to avoid fragmentation on smaller levels (e.g., 128MB) -
>> optimizing memory efficiency by turning off banks and such ...
> 
> It's a definitely a good question.

Oh, and I forgot that there might be users that want bigger granularity
:) (primarily, memory hotunplug that wants to avoid ZONE_MOVABLE  but
still have higher chances to eventually unplug some memory)

> 
>>>
>>>>
>>>>> If anybody has any ideas here, I'll appreciate a lot.
>>>>
>>>> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
>>>> somewhat mimics what CMA does (when sized reasonably), works well with
>>>> memory hot(un)plug, and is immune to misconfiguration. Within such a
>>>> zone, we can try to optimize the placement of larger blocks.
>>>
>>> Thank you for pointing at it!
>>>
>>> The main problem with it is the same as with ZONE_MOVABLE: it does require
>>> a boot-time educated guess on a good size. I admit that the CMA does too.
>>
>> "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from
>> highmem times) ares usually perfectly fine. And if you mess up - in
>> comparison to CMA - you won't shoot yourself in the foot, you get less
>> gigantic pages - which is usually better than before. I consider that a
>> clear win. Perfect? No. Can we be perfect? unlikely.
> 
> I'm not necessarily opposing your idea, I just think it will be tricky
> to not introduce an additional overhead if the ratio is not perfectly
> chosen. And there is simple a cost of adding a zone.

Not sure this will be really visible - and if your kernel requires more
than 20%..50% unmovable data than something is usually really
fishy/special. The nice thing is that Linux will try to "auto-optimize"
within each zone already.

My gut feeling is that it's way easier to teach Linux (add zone, add
mmop_type, build zonelists, split memory similar to movablecore) -
however, that doesn't imply that it's better. We'll have to see.

> 
> But fundamentally we're speaking about the same thing: grouping pages
> by their movability on a smaller scale. With a new zone we'll split
> pages into two parts with a fixed border, with new pageblock layer
> in 1GB blocks.

I also discussed moving the border on demand, which is way more tricky
and would definitely be stuff for the future.

There are some papers about similar fragmentation-avoidance techniques,
mostly in the context of energy efficiency IIRC. Especially:
- PALLOC: https://ieeexplore.ieee.org/document/6925999
- Adaptive-buddy:
https://ieeexplore.ieee.org/document/7397629?reload=true&arnumber=7397629

IIRC, the problem about such approaches is that they are quite invasive
and degrade some workloads due to overhead.

> 
> I think the agreement is that we need such functionality.

Yeah, on my long todo list. I'll be prototyping ZONE_RPEFER_MOVABLE
soon, to see how it looks/feels/performs.

-- 
Thanks,

David / dhildenb