On Tue, May 7, 2024 at 8:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 07/05/2024 14:53, Kefeng Wang wrote: > > > > > > On 2024/5/7 19:13, David Hildenbrand wrote: > >> > >>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 > >>> > >>>> suggest. If you want to try something semi-randomly; it might be useful to rule > >>>> out the arm64 contpte feature. I don't see how that would be interacting > >>>> here if > >>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with > >>>> ARM64_CONTPTE (needs EXPERT) at compile time. > >>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE, > >>> but will have a try. > > > > After ARM64_CONTPTE disabled, memory read latency is similar with ARM64_CONTPTE > > enabled(default 6.9-rc7), still larger than align anon reverted. > > OK thanks for trying. > > Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and using > that for all sizes. That will presumably be considered "large" by malloc and > will be allocated using mmap. So with the patch, it will be 2M aligned. Without > it, it probably won't. I'm still struggling to understand why not aligning it in > virtual space would make it more performant though... Yeah, I'm confused too. I just ran the same command on 6.6.13 (w/o the thp alignment patch and mTHP stuff) and 6.9-rc4 (w/ the thp alignment patch and all mTHP stuff) on my arm64 machine, but I didn't see such a pattern. The result has a little bit fluctuation, for example, 6.6.13 has better result with 4M/6M/8M, but 6.9-rc4 has better result for 12M/16M/32M/48M/64M, and the difference may be quite noticeable. But anyway I didn't see such a regression pattern. The benchmark is supposed to measure cache and memory latency, its result strongly relies on the cache and memory subsystem, for example, hw prefetcher, etc. > > Is it possible to provide the smaps output for at least that 512M+8K block for > both cases? It might give a bit of a clue. > > Do you have traditional (PMD-sized) THP enabled? If its enabled and unaligned > then the front of the buffer wouldn't be mapped with THP, but if it is aligned, > it will. That could affect it. > > > > >> > >> cont-pte can get active if we're just lucky when allocating pages in the right > >> order, correct Ryan? > >> >