Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory

anthony.yznaga@xxxxxxxxxx · Wed, 14 Nov 2018 15:15:33 -0800

On 11/10/2018 08:44 AM, Andrea Arcangeli wrote:
> On Sat, Nov 10, 2018 at 01:22:49PM +0000, Mel Gorman wrote:
>> On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote:
>>> And if you're in the camp that is concerned about the use of more RAM
>>> or/and about the higher latency of COW faults, I'm afraid the
>>> intermediate solution will be still slower than the already available
>>> MADV_NOHUGEPAGE or enabled=madvise.
>>>
>> Does that not prevent huge page usage? Maybe you can spell it out a bit
> Yes it prevents huge page usage, but preventing the huge page usage is
> also what is achieved with the reservation.
>
>> better. What is the set of system calls an application should make to
>> not use huge pages either for the address space or on a per-VMA basis
>> and defer to kcompactd? I know that can be tuned globally but that's not
>> quite the same thing given that multiple applications or containers can
>> be running with different requirements.
> Yes, in terms of inheritance that could be used to tune a container
> we've only PR_SET_THP_DISABLE, and that will render MADV_HUGEPAGE
> useless too, but then for microservices that should not be a
> concern. How to make those sysfs tunables reentrant in namespaces is a
> separate issue I think.
>
> The difference is that with the reservation over time they can be
> promoted, with MADV_NOHUGEPAGE they cannot become hugepages later, not
> even khugepaged will scan that vma anymore.
>
> The benefit of the reservation will showup in those regions that will
> not become hugepages, so if you can predict beforehand that those
> ranges don't benefit from THP, it's better if userland calls
> madvise(MADV_NOHUGEPAGE) on the range and then there's no need to undo
> the reservation later during memory pressure.
>
> The reservation and promotion is a bit like auto-detecting when
> MADV_NOHUGEPAGE should be set, so it boils down of how much of a
> corner case that is.
>
> I'm not so concerned about the RAM wasted because I don't think it's
> very significant, after all the application can just do a smaller
> malloc if it wants to reduce memory usage.
>
> A massive amount of huge RAM waste is fairly rare and to the extreme
> it could still be wasted even with 4k if the app uses only 1 bit from
> every 4k page it allocates with malloc.
>
> I'm more concerned about cases where THP is wasting CPU: like in redis
> that is hurted by the 2M COWs. redis will map all pages and they will
> be all promoted to THP also with the reservation logic applied, but
> when the parent writes to the memory (after fork) it must trigger 4k
> cows (not 2M cows) and in turn split the THP before the COW, or it
> won't work as fast as with THP disabled. In addition we should try to
> reuse the same IPI for the transhuge pmd split to cover the COW too.
>
> If we add the reservation and that work makes zero difference for the
> redis corner case, and redis must still use MADV_NOHUGEPAGE, it's not
> great in my view. It looks like we're trying to optimize issues that
> are less critical.
>
> The redis+THP case should be possible to optimize later with uffd WP
> model (once completed, Peter Xu is working on it), and uffd WP will
> also remove fork() and it'll convert it to a clone(). The granularity
> of the fault is decided by the userland that way so when uffd
> wrprotects a 4k fragment of a THP, the THP will be split during the
> uffd mprotect ioctl.
>
>>> Now about the implementation: the whole point of the reservation
>>> complexity is to skip the khugepaged copy, so it can collapse in
>>> place. Is skipping the copy worth it? Isn't the big cost the IPI
>>> anyway to avoid leaving two simultaneous TLB mappings of different
>>> granularity?
>>>
>> Not necessarily. With THP anon in the simple case, it might be just a
>> single thread and kcompact so that's one IPI (kcompactd flushes local and
>> one IPI to the CPU the thread was running on assuming it's not migrating
>> excessively). It would scale up with the number of threads but I suspect
>> the main cost is the actual copying, page table manipulation and the
>> locking required.
> Agreed, the IPI wouldn't be a concern for a single threaded app. I was
> looking more at the worst case scenario. For a single threaded app the
> locking should not be too bad either.
>
>> As an aside, a universal benefit would be looking at reducing the time
>> to allocate the necessary huge page as we know that can be excessive. It
>> would be ortogonal to this series.
> With what I suggested the allocation would happen as usual in
> khugepaged at slow peace, without holding locks. So I don't see
> obvious disadvantages in terms of THP allocation latency.
>
>> Could you and Kirill outline what sort of workloads you would consider
>> acceptable for evaluating this series? One would assume it covers at
>> least the following, potentially with a number of workloads.
> I would prefer to add intelligence to detect when COWs after fork
> should be done at 2m or 4k granularity (in the latter case by
> splitting the pmd before the actual COW while leaving the transhuge
> pmd intact in the other mm), because that would save CPU (and it'd
> automatically optimize redis). The snapshot process especially would
> run faster as it will read with THP performance.
And presumably to maintain the performance benefit in subsequent
snapshots the original split PMD would need to be re-promoted
prior to forking or promoted in the child during fork?

>
> I'm more worried to ensure THP doesn't cause more CPU usage like it
> happens to the above case in COWs, than to just try to save RAM when
> the virtual ranges are only partially utilized by the app.
>
>> 1. Evaluate the collapse and copying costs (probing the entire time
>>    spent in collapse_huge_page might do it)
>> 2. Evaluate mmap_sem hold time during hugepage collapse
>> 3. Estimate excessive RAM use due to unnecessary THP usage
>> 4. Estimate the slowdown due to delayed THP usage 
>>
>> 1 and 2 would indicate how much time is lost due to not using
>> reservations. That potentially goes in the direction of simply making
>> this faster -- fragmentation reduction (posted but unreviewed), faster
>> compaction searches, better page isolation during compaction to
>> avoid free pages being reused before an order-9 is free.
>>
>> 3 should be straight-forward but 4 would be the hardest to evaluate
>> because it would have to be determimed if 4 is offset by improvements to
>> 1-3. If 1-3 is improved enough, it might remove the motivation for the
>> series entirely.
>>
>> In other words, if we agree on a workload in advance, it might bring
>> this the right direction and not accidentally throw Anthony down a hole
>> working on a series that never gets ack'd.
>>
>> I'm not necessarily the best person to answer because my natural inclination
>> after the fragmentation series would be to keep using thpfiosacle
>> (from the fragmentation avoidance series) and work on improving the THP
>> allocation success rates and reduce latencies. I've tunnel vision on that
>> for the moment.
> Deciding the workloads is a good question indeed, but I would also be
> curious to how many of those pages would not end up to be promoted
> with this logic.
>
> What's the number of pte_none that you require in each pmd to avoid
> promotion? If it's just 1 then apps will run slower, if there's
> partial utilization THP already helps. I've an hard time to think at
> an ideal ratio, this is why max_ptes_none is 511 after all.
>
> Can we start by counting the total number of pte_none() in all pmds
> that can fit a THP according to vma->vm_start/end? The pagetable
> dumper in debugfs may already provide the info we need by scanning all
> mm and by printing the number of "none" pte that would generate
> "wasted" memory (and marginally wasted CPU during copy/clear).
>
> Then you can exactly tell how many pmds won't be promoted to transhuge
> pmds with the patch applied in the real life workloads, even before
> running any benchmark. It'd be good to be sure we're talking about a
> significant number in real life workloads or there's not much to
> optimize to begin with.
>
> If the amount of RAM saved is significant in real life workloads and
> in turn there's a chance of having a worthwhile tradeoff from the
> reservation logic, then we can do the benchmarks because the behavior
> will be different for the page fault, and it'll end up running slower
> with the reservation logic.

Thank you, Andrea and Mel, for the feedback.  I really appreciate it.
I'm going to proceed as suggested and evaluate the huge page
collapse and copy costs and perform more analysis on the potential
RAM savings.

Anthony

>
> Thanks,
> Andrea