On Fri, Jan 28, 2022 at 09:54:09PM +1300, Barry Song wrote: > On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > > On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote: > > > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > > <snipped> > > > > > > Large-scale deployments > > > > ----------------------- > > > > We've rolled out MGLRU to tens of millions of Chrome OS users and > > > > about a million Android users. Google's fleetwide profiling [13] shows > > > > an overall 40% decrease in kswapd CPU usage, in addition to > > > > > > Hi Yu, > > > > > > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64? > > > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG. > > > Does it help a lot in decreasing the cpu usage? > > > > Hi Barry, > > > > The fleet-wide profiling data I shared was from x86. For arm64, I only > > have data from synthetic benchmarks at the moment, and it also shows > > similar improvements. > > > > For Chrome OS (individual users), walk_pte_range(), the function that > > would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small > > portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't > > that helpful. > > Hi Yu, > Thanks! > > In the current kernel, depending on reverse mapping, while memory is > under pressure, > the cpu usage of kswapd can be very very high especially while a lot of pages > have large mapcount, thus a huge reverse mapping cost. Agreed. I've posted v7 which includes kswapd profiles collected from an arm64 v8.2 laptop under memory pressure. > Regarding <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG? No, it's from Snapdragon 7c. Please see the kswapd profiles in v7. > In this case, we can skip many PTE scans while PMD has no accessed bit > set. But for > a machine without NONLEAF, will the figure of cpu usage be much larger? So NONLEAF_PMD_YOUNG at most can save 4% CPU usage from kswapd. But this definitely can vary, depending on the workloads. > > > If so, this might be > > > a good proof that arm64 also needs this hardware feature? > > > In short, I am curious how much the improvement in this patchset depends > > > on the hardware ability of NONLEAF_PMD_YOUNG. > > > > For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value. > > In addition to cold/hot memory scanning, there are other use cases like > > dirty tracking, which can benefit from the accessed bit on non-leaf > > entries. I know some proprietary software uses this capability on x86 > > for different purposes than this patchset does. And AFAIK, x86 is the > > only arch that supports this capability, e.g., risc-v and ppc can only > > set the accessed bit in PTEs. > > Yep. NONLEAF is a nice feature. > > btw, page table should have a separate DIRTY bit, right? Yes. > wouldn't dirty page > tracking depend on the DIRTY bit rather than the accessed bit? It depends on the goal. > so x86 also has > NONLEAF dirty bit? No. > Or they are scanning accessed bit of PMD before > scanning DIRTY bits of PTEs? A mandatory sync to disk must use the dirty bit to ensure data integrity. But for a voluntary sync to disk, it can use the accessed bit to narrow the search of dirty pages. A mandatory sync is used to free specific dirty pages. A voluntary sync is used to keep the number of dirty pages low in general and it doesn't target any specific dirty pages. > > In fact, I've discussed this with one of the arm maintainers Will. So > > please check with him too if you are interested in moving forward with > > the idea. I might be able to provide with additional data if you need > > it to make a decision. > > I am interested in running it and have some data without NONLEAF > especially while free memory is very limited and the system has memory > thrashing. The v7 has a switch to disable this feature on x86. If you can run your workloads on x86, then it might be able to help you measure the difference.