On Sat, Nov 10, 2018 at 01:22:49PM +0000, Mel Gorman wrote: > On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote: > > And if you're in the camp that is concerned about the use of more RAM > > or/and about the higher latency of COW faults, I'm afraid the > > intermediate solution will be still slower than the already available > > MADV_NOHUGEPAGE or enabled=madvise. > > > > Does that not prevent huge page usage? Maybe you can spell it out a bit Yes it prevents huge page usage, but preventing the huge page usage is also what is achieved with the reservation. > better. What is the set of system calls an application should make to > not use huge pages either for the address space or on a per-VMA basis > and defer to kcompactd? I know that can be tuned globally but that's not > quite the same thing given that multiple applications or containers can > be running with different requirements. Yes, in terms of inheritance that could be used to tune a container we've only PR_SET_THP_DISABLE, and that will render MADV_HUGEPAGE useless too, but then for microservices that should not be a concern. How to make those sysfs tunables reentrant in namespaces is a separate issue I think. The difference is that with the reservation over time they can be promoted, with MADV_NOHUGEPAGE they cannot become hugepages later, not even khugepaged will scan that vma anymore. The benefit of the reservation will showup in those regions that will not become hugepages, so if you can predict beforehand that those ranges don't benefit from THP, it's better if userland calls madvise(MADV_NOHUGEPAGE) on the range and then there's no need to undo the reservation later during memory pressure. The reservation and promotion is a bit like auto-detecting when MADV_NOHUGEPAGE should be set, so it boils down of how much of a corner case that is. I'm not so concerned about the RAM wasted because I don't think it's very significant, after all the application can just do a smaller malloc if it wants to reduce memory usage. A massive amount of huge RAM waste is fairly rare and to the extreme it could still be wasted even with 4k if the app uses only 1 bit from every 4k page it allocates with malloc. I'm more concerned about cases where THP is wasting CPU: like in redis that is hurted by the 2M COWs. redis will map all pages and they will be all promoted to THP also with the reservation logic applied, but when the parent writes to the memory (after fork) it must trigger 4k cows (not 2M cows) and in turn split the THP before the COW, or it won't work as fast as with THP disabled. In addition we should try to reuse the same IPI for the transhuge pmd split to cover the COW too. If we add the reservation and that work makes zero difference for the redis corner case, and redis must still use MADV_NOHUGEPAGE, it's not great in my view. It looks like we're trying to optimize issues that are less critical. The redis+THP case should be possible to optimize later with uffd WP model (once completed, Peter Xu is working on it), and uffd WP will also remove fork() and it'll convert it to a clone(). The granularity of the fault is decided by the userland that way so when uffd wrprotects a 4k fragment of a THP, the THP will be split during the uffd mprotect ioctl. > > Now about the implementation: the whole point of the reservation > > complexity is to skip the khugepaged copy, so it can collapse in > > place. Is skipping the copy worth it? Isn't the big cost the IPI > > anyway to avoid leaving two simultaneous TLB mappings of different > > granularity? > > > > Not necessarily. With THP anon in the simple case, it might be just a > single thread and kcompact so that's one IPI (kcompactd flushes local and > one IPI to the CPU the thread was running on assuming it's not migrating > excessively). It would scale up with the number of threads but I suspect > the main cost is the actual copying, page table manipulation and the > locking required. Agreed, the IPI wouldn't be a concern for a single threaded app. I was looking more at the worst case scenario. For a single threaded app the locking should not be too bad either. > As an aside, a universal benefit would be looking at reducing the time > to allocate the necessary huge page as we know that can be excessive. It > would be ortogonal to this series. With what I suggested the allocation would happen as usual in khugepaged at slow peace, without holding locks. So I don't see obvious disadvantages in terms of THP allocation latency. > Could you and Kirill outline what sort of workloads you would consider > acceptable for evaluating this series? One would assume it covers at > least the following, potentially with a number of workloads. I would prefer to add intelligence to detect when COWs after fork should be done at 2m or 4k granularity (in the latter case by splitting the pmd before the actual COW while leaving the transhuge pmd intact in the other mm), because that would save CPU (and it'd automatically optimize redis). The snapshot process especially would run faster as it will read with THP performance. I'm more worried to ensure THP doesn't cause more CPU usage like it happens to the above case in COWs, than to just try to save RAM when the virtual ranges are only partially utilized by the app. > 1. Evaluate the collapse and copying costs (probing the entire time > spent in collapse_huge_page might do it) > 2. Evaluate mmap_sem hold time during hugepage collapse > 3. Estimate excessive RAM use due to unnecessary THP usage > 4. Estimate the slowdown due to delayed THP usage > > 1 and 2 would indicate how much time is lost due to not using > reservations. That potentially goes in the direction of simply making > this faster -- fragmentation reduction (posted but unreviewed), faster > compaction searches, better page isolation during compaction to > avoid free pages being reused before an order-9 is free. > > 3 should be straight-forward but 4 would be the hardest to evaluate > because it would have to be determimed if 4 is offset by improvements to > 1-3. If 1-3 is improved enough, it might remove the motivation for the > series entirely. > > In other words, if we agree on a workload in advance, it might bring > this the right direction and not accidentally throw Anthony down a hole > working on a series that never gets ack'd. > > I'm not necessarily the best person to answer because my natural inclination > after the fragmentation series would be to keep using thpfiosacle > (from the fragmentation avoidance series) and work on improving the THP > allocation success rates and reduce latencies. I've tunnel vision on that > for the moment. Deciding the workloads is a good question indeed, but I would also be curious to how many of those pages would not end up to be promoted with this logic. What's the number of pte_none that you require in each pmd to avoid promotion? If it's just 1 then apps will run slower, if there's partial utilization THP already helps. I've an hard time to think at an ideal ratio, this is why max_ptes_none is 511 after all. Can we start by counting the total number of pte_none() in all pmds that can fit a THP according to vma->vm_start/end? The pagetable dumper in debugfs may already provide the info we need by scanning all mm and by printing the number of "none" pte that would generate "wasted" memory (and marginally wasted CPU during copy/clear). Then you can exactly tell how many pmds won't be promoted to transhuge pmds with the patch applied in the real life workloads, even before running any benchmark. It'd be good to be sure we're talking about a significant number in real life workloads or there's not much to optimize to begin with. If the amount of RAM saved is significant in real life workloads and in turn there's a chance of having a worthwhile tradeoff from the reservation logic, then we can do the benchmarks because the behavior will be different for the page fault, and it'll end up running slower with the reservation logic. Thanks, Andrea