Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

Nanyong Sun <sunnanyong@xxxxxxxxxx> · Thu, 4 Jul 2024 19:47:01 +0800

On 2024/6/28 5:03, Yu Zhao wrote:
On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun <sunnanyong@xxxxxxxxxx> wrote:

在 2024/6/24 13:39, Yu Zhao 写道:
On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:
On 2024/3/14 7:32, David Rientjes wrote:

On Thu, 8 Feb 2024, Will Deacon wrote:

How about take a new lock with irq disabled during BBM, like:

+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+     (NEW_LOCK);
+    pte_clear(&init_mm, addr, ptep);
+    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+    set_pte_at(&init_mm, addr, ptep, pte);
+    spin_unlock_irq(NEW_LOCK);
+}
I really think the only maintainable way to achieve this is to avoid the
possibility of a fault altogether.

Will

Nanyong, are you still actively working on making HVO possible on arm64?

This would yield a substantial memory savings on hosts that are largely
configured with hugetlbfs.  In our case, the size of this hugetlbfs pool
is actually never changed after boot, but it sounds from the thread that
there was an idea to make HVO conditional on FEAT_BBM.  Is this being
pursued?

If so, any testing help needed?
I'm afraid that FEAT_BBM may not solve the problem here
I think so too -- I came cross this while working on TAO [1].

[1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@xxxxxxxxxx/

because from Arm
ARM,
I see that FEAT_BBM is only used for changing block size. Therefore, in this
HVO feature,
it can work in the split PMD stage, that is, BBM can be avoided in
vmemmap_split_pmd,
but in the subsequent vmemmap_remap_pte, the Output address of PTE still
needs to be
changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my
understanding
of ARM FEAT_BBM is wrong, and I hope someone can correct me.
Actually, the solution I first considered was to use the stop_machine
method, but we have
products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically
use hugepages,
so I have to consider performance issues. If your product does not change
the amount of huge
pages after booting, using stop_machine() may be a feasible way.
So far, I still haven't come up with a good solution.
I do have a patch that's similar to stop_machine() -- it uses NMI IPIs
to pause/resume remote CPUs while the local one is doing BBM.

Note that the problem of updating vmemmap for struct page[], as I see
it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory
hot removal in general [2]. On arm64, we would need to support BBM on
vmemmap so that we can fix the problem with offlining memory (or to be
precise, unmapping offlined struct page[]), by mapping offlined struct
page[] to a read-only page of dummy struct page[], similar to
ZERO_PAGE(). (Or we would have to make extremely invasive changes to
the reader side, i.e., all speculative PFN walkers.)

In case you are interested in testing my approach, you can swap your
patch 2 with the following:
I don't have an NMI IPI capable ARM machine on hand, so I think this feature
depends on a higher version of the ARM cpu.
(Pseudo) NMI does require GICv3 (released in 2015). But that's
independent from CPU versions. Just to double check: you don't have
GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or
irqchip.gicv3_pseudo_nmi=1), is that correct?

Even without GICv3, IPIs can be masked but still works, with a less
bounded latency.
Oh，I misunderstood. Pseudo NMI is available. We have 
CONFIG_ARM64_PSEUDO_NMI=y
but did not set irqchip.gicv3_pseudo_nmi=1 by default. So I can test 
this solution after
opening this in cmdline.

What I worried about was that other cores would occasionally be interrupted
frequently(8 times every 2M and 4096 times every 1G) and then wait for the
update of page table to complete before resuming.
Catalin has suggested batching, and to echo what he said [1]: it's
possible to make all vmemmap changes from a single HVO/de-HVO
operation into *one batch*.

[1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@xxxxxxx/

If there are workloads
running on other cores, performance may be affected. This implementation
speeds up stopping and resuming other cores, but they still have to wait
for the update to finish.
How often does your use case trigger HVO/de-HVO operations?

For our VM use case, it's generally correlated to VM lifetimes, i.e.,
how often VM bin-packing happens. For our THP use case, it can be more
often, but I still don't think we would trigger HVO/de-HVO every
minute. So with NMI IPIs, IMO, the performance impact would be
acceptable to our use cases.

.
We have many use cases so that I'm not thinking about a specific use case,
but rather a generic one. I will test the performance impact of different
HVO trigger frequencies, such as triggering HVO while running redis.