Re: [v2 PATCH] mm: shmem: allow split THP when truncating THP partially

David Hildenbrand <david@xxxxxxxxxx> · Tue, 25 Feb 2020 09:09:16 +0100

[...]

> I guess the question is if pressuring the guest to compact the memory
> to create more THP pages would add value versus letting the pressure
> from the inflation cause more potential fragmentation.

Would be interesting to see some actual numbers. Right now, it's just
speculations. I know that there are ideas to do proactive compaction,
maybe that has a similar effect.

[...]

> 
>>>
>>>> There was some work on huge page ballooning in a paper I read. But once
>>>> the guest is out of huge pages to report, it would want to fallback to
>>>> smaller granularity (down to 4k, to create real memory pressure), where
>>>> you would end up in the very same situation you are right now. So it's -
>>>> IMHO - only of limited used.
>>>
>>> I wouldn't think it would be that limited of a use case. By having the
>>> balloon inflate with higher order pages you should be able to put more
>>> pressure on the guest to compact the memory and reduce fragmentation
>>> instead of increasing it. If you have the balloon flushing out the
>>> lower order pages it is sitting on when there is pressure it seems
>>> like it would be more likely to reduce fragmentation further.
>>
>> As we have balloon compaction in place and balloon pages are movable, I
>> guess fragmentation is not really an issue.
> 
> I'm not sure that is truly the case. My concern is that by allocating
> the 4K pages we are breaking up the higher order pages and we aren't
> necessarily guaranteed to obtain all pieces of the higher order page
> when we break it up. As a result we could end up causing the THP pages
> to be broken up and scattered between the balloon and other consumers

We are allocating movable memory. We should be working on/creating
movable pageblocks. Yes, other concurrent allcoations can race with the
allocation. But after all, they are likely movable as well (because they
are allocating from a movable pageblock) and we do have compaction in
place. There are corner cases but in don't think they are very relevant.

[...]

>> Especially page compaction/migration in the guest might be tricky. AFAIK
>> it only works on oder-0 pages. E.g., whenever you allocated a
>> higher-order page in the guest and reported it to your hypervisor, you
>> want to split it into separate order-0 pages before adding them to the
>> balloon list. Otherwise, you won't be able to tag them as movable and
>> handle them via the existing balloon compaction framework - and that
>> would be a major step backwards, because you would be heavily
>> fragmenting your guest (and even turning MAX_ORDER - 1 into unmovable
>> pages means that memory offlining/alloc_contig_range() users won't be
>> able to move such pages around anymore).
> 
> Yes, from what I can tell compaction will not touch anything that is
> pageblock size or larger. I am not sure if that is an issue or not.
> 
> For migration is is a bit of a different story. It looks like there is
> logic in place for migrating huge and transparent huge pages, but not
> higher order pages. I'll have to take a look through the code some
> more to see just how difficult it would be to support migrating a 2M
> page. I can probably make it work if I just configure it as a
> transparent huge page with the appropriate flags and bits in the page
> being set.

Note: With virtio-balloon you actually don't necessarily want to migrate
the higher-order page. E.g., migrating a higher-order page might fail
because there is no migration target available. Instead, you would want
"migrate" to multiple smaller pieces. This is esp., interesting for
alloc_contig_range() users. Something that the current 4k pages can
handle just nicely.

> 
>> But then, the balloon compaction will result in single 4k pages getting
>> moved and deflated+inflated. Once you have order-0 pages in your list,
>> deflating higher-order pages becomes trickier.
> 
> I probably wouldn't want to maintain them as individual lists. In my
> mind it would make more sense to have two separate lists with separate
> handlers for each. Then in the event of something such as a deflate we
> could choose what we free based on the number of pages we need to
> free. That would allow us to deflate the balloon quicker in the case
> of a low-memory condition which should improve our responsiveness. In
> addition with the driver sitting on a reserve of higher-order pages it
> could help to alleviate fragmentation in such a case as well since it
> could release larger contiguous blocks of memory.
> 
>> E.g., have a look at the vmware balloon (drivers/misc/vmw_balloon.c). It
>> will allocate either 4k or 2MB pages, but won't be able to handle them
>> for balloon compaction. They don't even bother about other granularity.
>>
>>
>> Long story short: Inflating higher-order pages could be good for
>> inflation performance in some setups, but I think you'll have to fall
>> back to lower-order allocations +  balloon compaction on 4k.
> 
> I'm not entirely sure that is the case. It seems like with a few
> tweaks to things we could look at doing something like splitting the
> balloon so that we have a 4K and a 2M balloon. At that point it would
> just be a matter of registering a pair of address space handlers so
> that the 2M balloons are handled correctly if there is a request to
> migrate their memory. As far as compaction that is another story since
> it looks like 2M pages will not be compacted.

I am not convinced what you describe is a real issue that needs such a
solution. Maybe we can come up with numbers that prove this. (e.g.,
#THP, fragmentation, benchmark performance in your guest, etc.).

I'll try digging out that huge page ballooning for KVM paper, maybe that
has any value.

-- 
Thanks,

David / dhildenb