On 07/05/2024 14:53, Kefeng Wang wrote: > > > On 2024/5/7 19:13, David Hildenbrand wrote: >> >>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 >>> >>>> suggest. If you want to try something semi-randomly; it might be useful to rule >>>> out the arm64 contpte feature. I don't see how that would be interacting >>>> here if >>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with >>>> ARM64_CONTPTE (needs EXPERT) at compile time. >>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE, >>> but will have a try. > > After ARM64_CONTPTE disabled, memory read latency is similar with ARM64_CONTPTE > enabled(default 6.9-rc7), still larger than align anon reverted. OK thanks for trying. Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and using that for all sizes. That will presumably be considered "large" by malloc and will be allocated using mmap. So with the patch, it will be 2M aligned. Without it, it probably won't. I'm still struggling to understand why not aligning it in virtual space would make it more performant though... Is it possible to provide the smaps output for at least that 512M+8K block for both cases? It might give a bit of a clue. Do you have traditional (PMD-sized) THP enabled? If its enabled and unaligned then the front of the buffer wouldn't be mapped with THP, but if it is aligned, it will. That could affect it. > >> >> cont-pte can get active if we're just lucky when allocating pages in the right >> order, correct Ryan? >>