Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2024/5/8 16:36, Ryan Roberts wrote:
On 08/05/2024 08:48, Kefeng Wang wrote:


On 2024/5/8 1:17, Yang Shi wrote:
On Tue, May 7, 2024 at 8:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:

On 07/05/2024 14:53, Kefeng Wang wrote:


On 2024/5/7 19:13, David Hildenbrand wrote:

https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95

suggest. If you want to try something semi-randomly; it might be useful
to rule
out the arm64 contpte feature. I don't see how that would be interacting
here if
mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with
ARM64_CONTPTE (needs EXPERT) at compile time.
I don't enabled mTHP, so it should be not related about ARM64_CONTPTE,
but will have a try.

After ARM64_CONTPTE disabled, memory read latency is similar with ARM64_CONTPTE
enabled(default 6.9-rc7), still larger than align anon reverted.

OK thanks for trying.

Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and using
that for all sizes. That will presumably be considered "large" by malloc and
will be allocated using mmap. So with the patch, it will be 2M aligned. Without
it, it probably won't. I'm still struggling to understand why not aligning it in
virtual space would make it more performant though...

Yeah, I'm confused too.
Me too, I get a smaps[_rollup] for 0.09375M size, the biggest difference
for anon shows below, and all attached.

OK, a bit more insight; during initialization, the test makes 2 big malloc
calls; the first is 1M and the second is 512M+8K. I think those 2 are the 2 vmas
below (malloc is adding an extra page to the allocation, presumably for
management structures).

With efa7df3e3bb5 applied, the 1M allocation is allocated at a non-THP-aligned
address. All of its pages are populated (see permutation() which allocates and
writes it) but none of them are THP (obviously - its only 1M and THP is only
enabled for 2M). But the 512M region is allocated at a THP-aligned address. And
the first page is populated with a THP (presumably faulted when malloc writes to
its control structure page before the application even sees the allocated buffer.

In contrast, when efa7df3e3bb5 is reverted, neither of the vmas are THP-aligned,
and therefore the 512M region abutts the 1M region and the vmas are merged in
the kernel. So we end up with the single 525328 kB region. There are no THPs
allocated here (due to alignment constraiints) so we end up with the 1M region
fully populated with 4K pages as before, and only the malloc control page plus
the parts of the buffer that the application actually touches being populated in
the 512M region.

As far as I can tell, the application never touches the 1M region during the
test so it should be cache-cold. It only touches the first part of the 512M
buffer it needs for the size of the test (96K here?). The latency of allocating
the THP will have been consumed during test setup so I doubt we are seeing that
in the test results and I don't see why having a single TLB entry vs 96K/4K=24
entries would make it slower.

It is strange, and even more stranger, I got another machine(old machine
128 core and the new machine 96 core, but with same L1/L2 cache size
per-core), the new machine without this issue, will contact with our
hardware team, maybe some different configurations(prefetch or some
other similar hardware configurations) , thank for all the suggestion
and analysis!



It would be interesting to know the address that gets returned from malloc for
the 512M region if that's possible to get (in both cases)? I guess it is offset
into the first page. Perhaps it is offset such that with the THP alignment case
the 96K of interest ends up straddling 3 cache lines (cache line is 64K I
assume?), but for the unaligned case, it ends up nicely packed in 2?

CC zuoze, please help to check this.

Thank again.

Thanks,
Ryan


1) with efa7df3e3bb5 smaps

ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0
Size:             524300 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2048 kB
Pss:                2048 kB
Pss_Dirty:          2048 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      2048 kB
Referenced:         2048 kB
Anonymous:          2048 kB // we have 1 anon thp
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:      2048 kB

Yes one 2M THP shown here.

ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           1
VmFlags: rd wr mr mw me ac
ffff88eff000-ffff89000000 rw-p 00000000 00:00 0
Size:               1028 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                1028 kB
Pss:                1028 kB
Pss_Dirty:          1028 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      1028 kB
Referenced:         1028 kB
Anonymous:          1028 kB // another large anon

This is not THP, since you only have 2M THP enabled. This will be 1M of 4K page
allocations + 1 4K page malloc control structure, allocated and accessed by
permutation() during test setup.

KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
VmFlags: rd wr mr mw me ac

and the smap_rollup

00400000-fffff56bd000 ---p 00000000 00:00 0 [rollup]
Rss:                4724 kB
Pss:                3408 kB
Pss_Dirty:          3338 kB
Pss_Anon:           3338 kB
Pss_File:             70 kB
Pss_Shmem:             0 kB
Shared_Clean:       1176 kB
Shared_Dirty:        420 kB
Private_Clean:         0 kB
Private_Dirty:      3128 kB
Referenced:         4344 kB
Anonymous:          3548 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:      2048 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB

2) without efa7df3e3bb5 smaps

ffff9845b000-ffffb855f000 rw-p 00000000 00:00 0
Size:             525328 kB

This is a merged-vma version of the above 2 regions.

KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                1128 kB
Pss:                1128 kB
Pss_Dirty:          1128 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      1128 kB
Referenced:         1128 kB
Anonymous:          1128 kB // only large anon
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           1
VmFlags: rd wr mr mw me ac

and the smap_rollup,

00400000-ffffca5dc000 ---p 00000000 00:00 0 [rollup]
Rss:                2600 kB
Pss:                1472 kB
Pss_Dirty:          1388 kB
Pss_Anon:           1388 kB
Pss_File:             84 kB
Pss_Shmem:             0 kB
Shared_Clean:       1000 kB
Shared_Dirty:        424 kB
Private_Clean:         0 kB
Private_Dirty:      1176 kB
Referenced:         2220 kB
Anonymous:          1600 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux