On 2024/5/8 23:25, Yang Shi wrote:
On Wed, May 8, 2024 at 6:37 AM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
On 2024/5/8 16:36, Ryan Roberts wrote:
On 08/05/2024 08:48, Kefeng Wang wrote:
On 2024/5/8 1:17, Yang Shi wrote:
On Tue, May 7, 2024 at 8:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
On 07/05/2024 14:53, Kefeng Wang wrote:
On 2024/5/7 19:13, David Hildenbrand wrote:
https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95
suggest. If you want to try something semi-randomly; it might be useful
to rule
out the arm64 contpte feature. I don't see how that would be interacting
here if
mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with
ARM64_CONTPTE (needs EXPERT) at compile time.
I don't enabled mTHP, so it should be not related about ARM64_CONTPTE,
but will have a try.
After ARM64_CONTPTE disabled, memory read latency is similar with ARM64_CONTPTE
enabled(default 6.9-rc7), still larger than align anon reverted.
OK thanks for trying.
Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and using
that for all sizes. That will presumably be considered "large" by malloc and
will be allocated using mmap. So with the patch, it will be 2M aligned. Without
it, it probably won't. I'm still struggling to understand why not aligning it in
virtual space would make it more performant though...
Yeah, I'm confused too.
Me too, I get a smaps[_rollup] for 0.09375M size, the biggest difference
for anon shows below, and all attached.
OK, a bit more insight; during initialization, the test makes 2 big malloc
calls; the first is 1M and the second is 512M+8K. I think those 2 are the 2 vmas
below (malloc is adding an extra page to the allocation, presumably for
management structures).
With efa7df3e3bb5 applied, the 1M allocation is allocated at a non-THP-aligned
address. All of its pages are populated (see permutation() which allocates and
writes it) but none of them are THP (obviously - its only 1M and THP is only
enabled for 2M). But the 512M region is allocated at a THP-aligned address. And
the first page is populated with a THP (presumably faulted when malloc writes to
its control structure page before the application even sees the allocated buffer.
In contrast, when efa7df3e3bb5 is reverted, neither of the vmas are THP-aligned,
and therefore the 512M region abutts the 1M region and the vmas are merged in
the kernel. So we end up with the single 525328 kB region. There are no THPs
allocated here (due to alignment constraiints) so we end up with the 1M region
fully populated with 4K pages as before, and only the malloc control page plus
the parts of the buffer that the application actually touches being populated in
the 512M region.
As far as I can tell, the application never touches the 1M region during the
test so it should be cache-cold. It only touches the first part of the 512M
buffer it needs for the size of the test (96K here?). The latency of allocating
the THP will have been consumed during test setup so I doubt we are seeing that
in the test results and I don't see why having a single TLB entry vs 96K/4K=24
entries would make it slower.
It is strange, and even more stranger, I got another machine(old machine
128 core and the new machine 96 core, but with same L1/L2 cache size
per-core), the new machine without this issue, will contact with our
hardware team, maybe some different configurations(prefetch or some
other similar hardware configurations) , thank for all the suggestion
and analysis!
Yes, the benchmark result strongly relies on cache and memory
subsystem. See the below analysis.
It would be interesting to know the address that gets returned from malloc for
the 512M region if that's possible to get (in both cases)? I guess it is offset
into the first page. Perhaps it is offset such that with the THP alignment case
the 96K of interest ends up straddling 3 cache lines (cache line is 64K I
assume?), but for the unaligned case, it ends up nicely packed in 2?
CC zuoze, please help to check this.
Thank again.
Thanks,
Ryan
1) with efa7df3e3bb5 smaps
ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0
Size: 524300 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 2048 kB
Pss: 2048 kB
Pss_Dirty: 2048 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 2048 kB
Referenced: 2048 kB
Anonymous: 2048 kB // we have 1 anon thp
KSM: 0 kB
LazyFree: 0 kB
AnonHugePages: 2048 kB
Yes one 2M THP shown here.
You have THP allocated. W/o commit efa7df3e3bb5 the address may be not
PMD aligned (it still could be, but just not that likely), the base
pages were allocated. To get an apple to apple comparison, you need to
disable THP by setting /sys/kernel/mm/transparent_hugepage/enabled to
madvise or never, then you will get base pages too (IIRC lmbench
doesn't call MADV_HUGEPAGE).
Yes, we tested no THP(disable by sysfs) before, no different w/ or w/o
this efa7df3e3bb5.
The address alignment or page size may have a negative impact to your
CPU's cache and memory subsystem, for example, hw prefetcher. But I
saw a slight improvement with THP on my machine. So the behavior
strongly depends on the hardware.
I hope this efa7df3e3bb5 could improve performance so I backport it
into our kernel, but found the above issue, and same result when retest
with the 6.9-rc7, since different hardware show different results, we
will test more hardware and try to contact with hardware team, thanks
for your help.