Re: [PATCH] arm64: mm: fix linear mapping mem access performace degradation

"guanghui.fgh" <guanghuifeng@xxxxxxxxxxxxxxxxx> · Tue, 28 Jun 2022 15:52:48 +0800

在 2022/6/28 11:06, guanghui.fgh 写道:
Thanks.

在 2022/6/28 9:34, Leizhen (ThunderTown) 写道:

On 2022/6/27 20:25, guanghui.fgh wrote:
Thanks.

在 2022/6/27 20:06, Leizhen (ThunderTown) 写道:

On 2022/6/27 18:46, guanghui.fgh wrote:

在 2022/6/27 17:49, Mike Rapoport 写道:
Please don't post HTML.

On Mon, Jun 27, 2022 at 05:24:10PM +0800, guanghui.fgh wrote:
Thanks.

在 2022/6/27 14:34, Mike Rapoport 写道:

       On Sun, Jun 26, 2022 at 07:10:15PM +0800, Guanghui Feng 
wrote:

           The arm64 can build 2M/1G block/sectiion mapping. When 
using DMA/DMA32 zone
           (enable crashkernel, disable rodata full, disable 
kfence), the mem_map will
           use non block/section mapping(for crashkernel requires 
to shrink the region
           in page granularity). But it will degrade performance 
when doing larging
           continuous mem access in kernel(memcpy/memmove, etc).

           There are many changes and discussions:
           commit 031495635b46
           commit 1a8e1cef7603
           commit 8424ecdde7df
           commit 0a30c53573b0
           commit 2687275a5843

       Please include oneline summary of the commit. (See section 
"Describe your
       changes" in Documentation/process/submitting-patches.rst)

OK, I will add oneline summary in the git commit messages.

           This patch changes mem_map to use block/section 
mapping with crashkernel.
           Firstly, do block/section mapping(normally 2M or 1G) 
for all avail mem at
           mem_map, reserve crashkernel memory. And then walking 
pagetable to split
           block/section mapping to non block/section 
mapping(normally 4K) [[[only]]]
           for crashkernel mem.

       This already happens when ZONE_DMA/ZONE_DMA32 are 
disabled. Please explain
       why is it Ok to change the way the memory is mapped with
       ZONE_DMA/ZONE_DMA32 enabled.

In short:

1.building all avail mem with block/section mapping（normally 
1G/2M） without
inspecting crashkernel
2. Reserve crashkernel mem as same as previous doing
3. only change the crashkernle mem mapping to normal 
mapping(normally 4k).
With this method, there are block/section mapping as more as 
possible.

This does not answer the question why changing the way the memory 
is mapped
when there is ZONE_DMA/DMA32 and crashkernel won't cause a 
regression.

1.Quoted messages from arch/arm64/mm/init.c

"Memory reservation for crash kernel either done early or deferred
depending on DMA memory zones configs (ZONE_DMA) --

In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
here instead of max_zone_phys().  This lets early reservation of
crash kernel memory which has a dependency on arm64_dma_phys_limit.
Reserving memory early for crash kernel allows linear creation of 
block
mappings (greater than page-granularity) for all the memory bank 
rangs.
In this scheme a comparatively quicker boot is observed.

If ZONE_DMA configs are defined, crash kernel memory reservation
is delayed until DMA zone memory range size initialization 
performed in
zone_sizes_init().  The defer is necessary to steer clear of DMA zone
memory range to avoid overlap allocation.  So crash kernel memory 
boundaries are not known when mapping all bank memory ranges, which 
otherwise means not possible to exclude crash kernel range from 
creating block mappings so page-granularity mappings are created 
for the entire memory range."

Namely, the init order: memblock init--->linear mem mapping(4k 
mapping for crashkernel, requirinig page-granularity 
changing))--->zone dma limit--->reserve crashkernel.
So when enable ZONE DMA and using crashkernel, the mem mapping 
using 4k mapping.

2.As mentioned above, when linear mem use 4k mapping simply, there 
is high dtlb miss(degrade performance).
This patch use block/section mapping as far as possible with 
performance improvement.

3.This patch reserve crashkernel as same as the history(ZONE DMA & 
crashkernel reserving order), and only change the linear mem 
mapping to block/section mapping.
.

I think Mike Rapoport's probably asking you to answer whether you've
taken into account such as BBM. For example, the following code:
we should prepare the next level pgtable first, then change 2M block
mapping to 4K page mapping, and flush TLB at the end.
+static void init_crashkernel_pmd(pud_t *pudp, unsigned long addr,
+                 unsigned long end, phys_addr_t phys,
+                 pgprot_t prot,
+                 phys_addr_t (*pgtable_alloc)(int), int flags)
+{
+    phys_addr_t map_offset;
+    unsigned long next;
+    pmd_t *pmdp;
+    pmdval_t pmdval;
+
+    pmdp = pmd_offset(pudp, addr);
+    do {
+        next = pmd_addr_end(addr, end);
+        if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) {
+            phys_addr_t pte_phys = pgtable_alloc(PAGE_SHIFT);
+            pmd_clear(pmdp);
+            pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
+            if (flags & NO_EXEC_MAPPINGS)
+                pmdval |= PMD_TABLE_PXN;
+            __pmd_populate(pmdp, pte_phys, pmdval);
+            flush_tlb_kernel_range(addr, addr + PAGE_SIZE);

The pgtable is empty now. However, memory other than crashkernel may 
be being accessed.
1.When reserving crashkernel and remapping linear mem mapping, there 
is only one boot cpu running. There is no other cpu/thread running at 
the same time.

So, put this in the code comment?
OK.

If scalability is considered and unpredictable changes occur in the 
future, for example,
other modules also need this mapping function. It would be better to 
deal with the BBM now,
and make this public.
OK, could you give me some advice?

2.When clearing block/section mapping, I have flush tlb by 
flush_tlb_kernel_range. Afterwards rebuilt 4k mapping(I think it's no 
need flush tlb).

+
+            map_offset = addr - (addr & PMD_MASK);
+            if (map_offset)
+                alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr,
+                        phys - map_offset, prot,
+                        pgtable_alloc, flags);
+
+            if (next < (addr & PMD_MASK) + PMD_SIZE)
+                alloc_init_cont_pte(pmdp, next, (addr & PUD_MASK) +
+                        PUD_SIZE, next - addr + phys,
+                        prot, pgtable_alloc, flags);

Here and alloc_crashkernel_pud() should use the raw flags. It may not 
contain  (NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS)
Yes. the mem out of crashkernel should use block/section mapping as far 
as possible including the LeftMargin and RightMargin.
But I had test it on HiSilicon Kunpeng 920-6426 with it and get 
performacne degrade(without NO_BLOCK_MAPPINGS/NO_CONT_MAPPINGS flags for 
the left/right margin)
It's strange, could you give some advice? Maybe it's good for other arm 
platform except for HiSilicon Kunpeng 920-6426.
There should split non-crashkernel mem [[[ without ]]]
NO_BLOCK_MAPPINGS/NO_CONT_MAPPINGS flags

I had test it on other arm platform [[[ non HiSilicon arm platform ]]] 
and also get performance improvement greatly.

Could you help me to check the difference betweent HiSilicon Kunpeng 
920-6426 and other arm platform for the block/section mapping TLB support?

+        }
+        alloc_crashkernel_cont_pte(pmdp, addr, next, phys, prot,
+                       pgtable_alloc, flags);
+        phys += next - addr;
+    } while (pmdp++, addr = next, addr != end);
+}

.