Yu Zhao <yuzhao@xxxxxxxxxx> writes: > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote: >> Yu Zhao <yuzhao@xxxxxxxxxx> writes: >> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote: >> > The scanning overhead is only one of the two major problems of the >> > current page reclaim. The other problem is the granularity of the >> > active/inactive (sizes). We stopped using them in making job >> > scheduling decision a long time ago. I know another large internet >> > company adopted a similar approach as ours, and I'm wondering how >> > everybody else is coping with the discrepancy from those counters. >> >> From intuition, the scanning overhead of the full page table scanning >> appears higher than that of the rmap scanning for a small portion of >> system memory. But form your words, you think the reality is the >> reverse? If others concern about the overhead too, finally, I think you >> need to prove the overhead of the page table scanning isn't too higher, >> or even lower with more data and theory. > > There is a misunderstanding here. I never said anything about full > page table scanning. And this is not how it's done in this series > either. I guess the misunderstanding has something to do with the cold > memory tracking you are thinking about? If my understanding were correct, from the following code path in your patch 10/14, age_active_anon age_lru_gens try_walk_mm_list walk_mm_list walk_mm So, in kswapd(), the page tables of many processes may be scanned fully. If the number of processes that are active are high, the overhead may be high too. > This series uses page tables to discover page accesses when a system > has run out of inactive pages. Under such a situation, the system is > very likely to have a lot of page accesses, and using the rmap is > likely to cost a lot more because its poor memory locality compared > with page tables. This is the theory. Can you verify this with more data? Including the CPU cycles or time spent scanning page tables? > But, page tables can be sparse too, in terms of hot memory tracking. > Dave has asked me to test the worst case scenario, which I'll do. > And I'd be happy to share more data. Any specific workload you are > interested in? We can start with some simple workloads that are easier to be reasoned. For example, 1. Run the workload with hot and cold pages, when the free memory becomes lower than the low watermark, kswapd will be waken up to scan and reclaim some cold pages. How long will it take to do that? It's expected that almost all pages need to be scanned, so that page table scanning is expected to have less overhead. We can measure how well it is. 2. Run the workload with hot and cold pages, if the whole working-set cannot fit in DRAM, that is, the cold pages will be reclaimed and swapped in regularly (for example tens MB/s). It's expected that less pages may be scanned with rmap, but the speed of page table scanning is faster. 3. Run the workload with hot and cold pages, the system is overcommitted, that is, some cold pages will be placed in swap. But the cold pages are cold enough, so there's almost no thrashing. Then the hot working-set of the workload changes, that is, some hot pages become cold, while some cold pages becomes hot, so page reclaiming and swapin will be triggered. For each cases, we can use some different parameters. And we can measure something like the number of pages scanned, the time taken to scan them, the number of page reclaimed and swapped in, etc. Best Regards, Huang, Ying