On 08/05/2024 14:37, Kefeng Wang wrote: > > > On 2024/5/8 16:36, Ryan Roberts wrote: >> On 08/05/2024 08:48, Kefeng Wang wrote: >>> >>> >>> On 2024/5/8 1:17, Yang Shi wrote: >>>> On Tue, May 7, 2024 at 8:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>>> >>>>> On 07/05/2024 14:53, Kefeng Wang wrote: >>>>>> >>>>>> >>>>>> On 2024/5/7 19:13, David Hildenbrand wrote: >>>>>>> >>>>>>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 >>>>>>>> >>>>>>>>> suggest. If you want to try something semi-randomly; it might be useful >>>>>>>>> to rule >>>>>>>>> out the arm64 contpte feature. I don't see how that would be interacting >>>>>>>>> here if >>>>>>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable >>>>>>>>> with >>>>>>>>> ARM64_CONTPTE (needs EXPERT) at compile time. >>>>>>>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE, >>>>>>>> but will have a try. >>>>>> >>>>>> After ARM64_CONTPTE disabled, memory read latency is similar with >>>>>> ARM64_CONTPTE >>>>>> enabled(default 6.9-rc7), still larger than align anon reverted. >>>>> >>>>> OK thanks for trying. >>>>> >>>>> Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and >>>>> using >>>>> that for all sizes. That will presumably be considered "large" by malloc and >>>>> will be allocated using mmap. So with the patch, it will be 2M aligned. >>>>> Without >>>>> it, it probably won't. I'm still struggling to understand why not aligning >>>>> it in >>>>> virtual space would make it more performant though... >>>> >>>> Yeah, I'm confused too. >>> Me too, I get a smaps[_rollup] for 0.09375M size, the biggest difference >>> for anon shows below, and all attached. >> >> OK, a bit more insight; during initialization, the test makes 2 big malloc >> calls; the first is 1M and the second is 512M+8K. I think those 2 are the 2 vmas >> below (malloc is adding an extra page to the allocation, presumably for >> management structures). >> >> With efa7df3e3bb5 applied, the 1M allocation is allocated at a non-THP-aligned >> address. All of its pages are populated (see permutation() which allocates and >> writes it) but none of them are THP (obviously - its only 1M and THP is only >> enabled for 2M). But the 512M region is allocated at a THP-aligned address. And >> the first page is populated with a THP (presumably faulted when malloc writes to >> its control structure page before the application even sees the allocated buffer. >> >> In contrast, when efa7df3e3bb5 is reverted, neither of the vmas are THP-aligned, >> and therefore the 512M region abutts the 1M region and the vmas are merged in >> the kernel. So we end up with the single 525328 kB region. There are no THPs >> allocated here (due to alignment constraiints) so we end up with the 1M region >> fully populated with 4K pages as before, and only the malloc control page plus >> the parts of the buffer that the application actually touches being populated in >> the 512M region. >> >> As far as I can tell, the application never touches the 1M region during the >> test so it should be cache-cold. It only touches the first part of the 512M >> buffer it needs for the size of the test (96K here?). The latency of allocating >> the THP will have been consumed during test setup so I doubt we are seeing that >> in the test results and I don't see why having a single TLB entry vs 96K/4K=24 >> entries would make it slower. > > It is strange, and even more stranger, I got another machine(old machine > 128 core and the new machine 96 core, but with same L1/L2 cache size > per-core), the new machine without this issue, will contact with our > hardware team, maybe some different configurations(prefetch or some > other similar hardware configurations) , thank for all the suggestion > and analysis! No problem, you're welcome! > > >> >> It would be interesting to know the address that gets returned from malloc for >> the 512M region if that's possible to get (in both cases)? I guess it is offset >> into the first page. Perhaps it is offset such that with the THP alignment case >> the 96K of interest ends up straddling 3 cache lines (cache line is 64K I >> assume?), but for the unaligned case, it ends up nicely packed in 2? > > CC zuoze, please help to check this. > > Thank again. >> >> Thanks, >> Ryan >> >>> >>> 1) with efa7df3e3bb5 smaps >>> >>> ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0 >>> Size: 524300 kB >>> KernelPageSize: 4 kB >>> MMUPageSize: 4 kB >>> Rss: 2048 kB >>> Pss: 2048 kB >>> Pss_Dirty: 2048 kB >>> Shared_Clean: 0 kB >>> Shared_Dirty: 0 kB >>> Private_Clean: 0 kB >>> Private_Dirty: 2048 kB >>> Referenced: 2048 kB >>> Anonymous: 2048 kB // we have 1 anon thp >>> KSM: 0 kB >>> LazyFree: 0 kB >>> AnonHugePages: 2048 kB >> >> Yes one 2M THP shown here. >> >>> ShmemPmdMapped: 0 kB >>> FilePmdMapped: 0 kB >>> Shared_Hugetlb: 0 kB >>> Private_Hugetlb: 0 kB >>> Swap: 0 kB >>> SwapPss: 0 kB >>> Locked: 0 kB >>> THPeligible: 1 >>> VmFlags: rd wr mr mw me ac >>> ffff88eff000-ffff89000000 rw-p 00000000 00:00 0 >>> Size: 1028 kB >>> KernelPageSize: 4 kB >>> MMUPageSize: 4 kB >>> Rss: 1028 kB >>> Pss: 1028 kB >>> Pss_Dirty: 1028 kB >>> Shared_Clean: 0 kB >>> Shared_Dirty: 0 kB >>> Private_Clean: 0 kB >>> Private_Dirty: 1028 kB >>> Referenced: 1028 kB >>> Anonymous: 1028 kB // another large anon >> >> This is not THP, since you only have 2M THP enabled. This will be 1M of 4K page >> allocations + 1 4K page malloc control structure, allocated and accessed by >> permutation() during test setup. >> >>> KSM: 0 kB >>> LazyFree: 0 kB >>> AnonHugePages: 0 kB >>> ShmemPmdMapped: 0 kB >>> FilePmdMapped: 0 kB >>> Shared_Hugetlb: 0 kB >>> Private_Hugetlb: 0 kB >>> Swap: 0 kB >>> SwapPss: 0 kB >>> Locked: 0 kB >>> THPeligible: 0 >>> VmFlags: rd wr mr mw me ac >>> >>> and the smap_rollup >>> >>> 00400000-fffff56bd000 ---p 00000000 00:00 0 [rollup] >>> Rss: 4724 kB >>> Pss: 3408 kB >>> Pss_Dirty: 3338 kB >>> Pss_Anon: 3338 kB >>> Pss_File: 70 kB >>> Pss_Shmem: 0 kB >>> Shared_Clean: 1176 kB >>> Shared_Dirty: 420 kB >>> Private_Clean: 0 kB >>> Private_Dirty: 3128 kB >>> Referenced: 4344 kB >>> Anonymous: 3548 kB >>> KSM: 0 kB >>> LazyFree: 0 kB >>> AnonHugePages: 2048 kB >>> ShmemPmdMapped: 0 kB >>> FilePmdMapped: 0 kB >>> Shared_Hugetlb: 0 kB >>> Private_Hugetlb: 0 kB >>> Swap: 0 kB >>> SwapPss: 0 kB >>> Locked: 0 kB >>> >>> 2) without efa7df3e3bb5 smaps >>> >>> ffff9845b000-ffffb855f000 rw-p 00000000 00:00 0 >>> Size: 525328 kB >> >> This is a merged-vma version of the above 2 regions. >> >>> KernelPageSize: 4 kB >>> MMUPageSize: 4 kB >>> Rss: 1128 kB >>> Pss: 1128 kB >>> Pss_Dirty: 1128 kB >>> Shared_Clean: 0 kB >>> Shared_Dirty: 0 kB >>> Private_Clean: 0 kB >>> Private_Dirty: 1128 kB >>> Referenced: 1128 kB >>> Anonymous: 1128 kB // only large anon >>> KSM: 0 kB >>> LazyFree: 0 kB >>> AnonHugePages: 0 kB >>> ShmemPmdMapped: 0 kB >>> FilePmdMapped: 0 kB >>> Shared_Hugetlb: 0 kB >>> Private_Hugetlb: 0 kB >>> Swap: 0 kB >>> SwapPss: 0 kB >>> Locked: 0 kB >>> THPeligible: 1 >>> VmFlags: rd wr mr mw me ac >>> >>> and the smap_rollup, >>> >>> 00400000-ffffca5dc000 ---p 00000000 00:00 0 [rollup] >>> Rss: 2600 kB >>> Pss: 1472 kB >>> Pss_Dirty: 1388 kB >>> Pss_Anon: 1388 kB >>> Pss_File: 84 kB >>> Pss_Shmem: 0 kB >>> Shared_Clean: 1000 kB >>> Shared_Dirty: 424 kB >>> Private_Clean: 0 kB >>> Private_Dirty: 1176 kB >>> Referenced: 2220 kB >>> Anonymous: 1600 kB >>> KSM: 0 kB >>> LazyFree: 0 kB >>> AnonHugePages: 0 kB >>> ShmemPmdMapped: 0 kB >>> FilePmdMapped: 0 kB >>> Shared_Hugetlb: 0 kB >>> Private_Hugetlb: 0 kB >>> Swap: 0 kB >>> SwapPss: 0 kB >>> Locked: 0 kB >>>