On Wed, May 8, 2024 at 6:37 AM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote: > > > > On 2024/5/8 16:36, Ryan Roberts wrote: > > On 08/05/2024 08:48, Kefeng Wang wrote: > >> > >> > >> On 2024/5/8 1:17, Yang Shi wrote: > >>> On Tue, May 7, 2024 at 8:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>> > >>>> On 07/05/2024 14:53, Kefeng Wang wrote: > >>>>> > >>>>> > >>>>> On 2024/5/7 19:13, David Hildenbrand wrote: > >>>>>> > >>>>>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 > >>>>>>> > >>>>>>>> suggest. If you want to try something semi-randomly; it might be useful > >>>>>>>> to rule > >>>>>>>> out the arm64 contpte feature. I don't see how that would be interacting > >>>>>>>> here if > >>>>>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with > >>>>>>>> ARM64_CONTPTE (needs EXPERT) at compile time. > >>>>>>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE, > >>>>>>> but will have a try. > >>>>> > >>>>> After ARM64_CONTPTE disabled, memory read latency is similar with ARM64_CONTPTE > >>>>> enabled(default 6.9-rc7), still larger than align anon reverted. > >>>> > >>>> OK thanks for trying. > >>>> > >>>> Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and using > >>>> that for all sizes. That will presumably be considered "large" by malloc and > >>>> will be allocated using mmap. So with the patch, it will be 2M aligned. Without > >>>> it, it probably won't. I'm still struggling to understand why not aligning it in > >>>> virtual space would make it more performant though... > >>> > >>> Yeah, I'm confused too. > >> Me too, I get a smaps[_rollup] for 0.09375M size, the biggest difference > >> for anon shows below, and all attached. > > > > OK, a bit more insight; during initialization, the test makes 2 big malloc > > calls; the first is 1M and the second is 512M+8K. I think those 2 are the 2 vmas > > below (malloc is adding an extra page to the allocation, presumably for > > management structures). > > > > With efa7df3e3bb5 applied, the 1M allocation is allocated at a non-THP-aligned > > address. All of its pages are populated (see permutation() which allocates and > > writes it) but none of them are THP (obviously - its only 1M and THP is only > > enabled for 2M). But the 512M region is allocated at a THP-aligned address. And > > the first page is populated with a THP (presumably faulted when malloc writes to > > its control structure page before the application even sees the allocated buffer. > > > > In contrast, when efa7df3e3bb5 is reverted, neither of the vmas are THP-aligned, > > and therefore the 512M region abutts the 1M region and the vmas are merged in > > the kernel. So we end up with the single 525328 kB region. There are no THPs > > allocated here (due to alignment constraiints) so we end up with the 1M region > > fully populated with 4K pages as before, and only the malloc control page plus > > the parts of the buffer that the application actually touches being populated in > > the 512M region. > > > > As far as I can tell, the application never touches the 1M region during the > > test so it should be cache-cold. It only touches the first part of the 512M > > buffer it needs for the size of the test (96K here?). The latency of allocating > > the THP will have been consumed during test setup so I doubt we are seeing that > > in the test results and I don't see why having a single TLB entry vs 96K/4K=24 > > entries would make it slower. > > It is strange, and even more stranger, I got another machine(old machine > 128 core and the new machine 96 core, but with same L1/L2 cache size > per-core), the new machine without this issue, will contact with our > hardware team, maybe some different configurations(prefetch or some > other similar hardware configurations) , thank for all the suggestion > and analysis! Yes, the benchmark result strongly relies on cache and memory subsystem. See the below analysis. > > > > > > It would be interesting to know the address that gets returned from malloc for > > the 512M region if that's possible to get (in both cases)? I guess it is offset > > into the first page. Perhaps it is offset such that with the THP alignment case > > the 96K of interest ends up straddling 3 cache lines (cache line is 64K I > > assume?), but for the unaligned case, it ends up nicely packed in 2? > > CC zuoze, please help to check this. > > Thank again. > > > > Thanks, > > Ryan > > > >> > >> 1) with efa7df3e3bb5 smaps > >> > >> ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0 > >> Size: 524300 kB > >> KernelPageSize: 4 kB > >> MMUPageSize: 4 kB > >> Rss: 2048 kB > >> Pss: 2048 kB > >> Pss_Dirty: 2048 kB > >> Shared_Clean: 0 kB > >> Shared_Dirty: 0 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 2048 kB > >> Referenced: 2048 kB > >> Anonymous: 2048 kB // we have 1 anon thp > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 2048 kB > > > > Yes one 2M THP shown here. You have THP allocated. W/o commit efa7df3e3bb5 the address may be not PMD aligned (it still could be, but just not that likely), the base pages were allocated. To get an apple to apple comparison, you need to disable THP by setting /sys/kernel/mm/transparent_hugepage/enabled to madvise or never, then you will get base pages too (IIRC lmbench doesn't call MADV_HUGEPAGE). The address alignment or page size may have a negative impact to your CPU's cache and memory subsystem, for example, hw prefetcher. But I saw a slight improvement with THP on my machine. So the behavior strongly depends on the hardware. > > > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> THPeligible: 1 > >> VmFlags: rd wr mr mw me ac > >> ffff88eff000-ffff89000000 rw-p 00000000 00:00 0 > >> Size: 1028 kB > >> KernelPageSize: 4 kB > >> MMUPageSize: 4 kB > >> Rss: 1028 kB > >> Pss: 1028 kB > >> Pss_Dirty: 1028 kB > >> Shared_Clean: 0 kB > >> Shared_Dirty: 0 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 1028 kB > >> Referenced: 1028 kB > >> Anonymous: 1028 kB // another large anon > > > > This is not THP, since you only have 2M THP enabled. This will be 1M of 4K page > > allocations + 1 4K page malloc control structure, allocated and accessed by > > permutation() during test setup. > > > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 0 kB > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> THPeligible: 0 > >> VmFlags: rd wr mr mw me ac > >> > >> and the smap_rollup > >> > >> 00400000-fffff56bd000 ---p 00000000 00:00 0 [rollup] > >> Rss: 4724 kB > >> Pss: 3408 kB > >> Pss_Dirty: 3338 kB > >> Pss_Anon: 3338 kB > >> Pss_File: 70 kB > >> Pss_Shmem: 0 kB > >> Shared_Clean: 1176 kB > >> Shared_Dirty: 420 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 3128 kB > >> Referenced: 4344 kB > >> Anonymous: 3548 kB > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 2048 kB > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> > >> 2) without efa7df3e3bb5 smaps > >> > >> ffff9845b000-ffffb855f000 rw-p 00000000 00:00 0 > >> Size: 525328 kB > > > > This is a merged-vma version of the above 2 regions. > > > >> KernelPageSize: 4 kB > >> MMUPageSize: 4 kB > >> Rss: 1128 kB > >> Pss: 1128 kB > >> Pss_Dirty: 1128 kB > >> Shared_Clean: 0 kB > >> Shared_Dirty: 0 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 1128 kB > >> Referenced: 1128 kB > >> Anonymous: 1128 kB // only large anon > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 0 kB > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> THPeligible: 1 > >> VmFlags: rd wr mr mw me ac > >> > >> and the smap_rollup, > >> > >> 00400000-ffffca5dc000 ---p 00000000 00:00 0 [rollup] > >> Rss: 2600 kB > >> Pss: 1472 kB > >> Pss_Dirty: 1388 kB > >> Pss_Anon: 1388 kB > >> Pss_File: 84 kB > >> Pss_Shmem: 0 kB > >> Shared_Clean: 1000 kB > >> Shared_Dirty: 424 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 1176 kB > >> Referenced: 2220 kB > >> Anonymous: 1600 kB > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 0 kB > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >>