On Fri, 8 Apr 2022 15:12:19 +0800 Huang Ying <ying.huang@xxxxxxxxx> wrote: > To optimize page placement in a memory tiering system with NUMA > balancing, the hot pages in the slow memory node need to be > identified. Essentially, the original NUMA balancing implementation > selects and promote the mostly recently accessed (MRU) pages. But the > recently accessed pages may be cold. Wait. What. How on earth could the most recently accessed page be considered "cold"? Changelog needs work, please. > So in this patchset, we > implement a new hot page identification algorithm based on the latency > between NUMA balancing page table scanning and hint page fault. That sounds reasonable. > And the hot page promotion can incur some overhead in the system. To > control the overhead a simple promotion rate limit mechanism is > implemented. That sounds like a hack to fix a problem you just added? Maybe a reasonable one.. > The hot threshold used to identify the hot pages is workload dependent > usually. So we also implemented a hot threshold automatic adjustment > algorithm. The basic idea is to increase/decrease the hot threshold > to make the number of pages that pass the hot threshold (promote > candidate) near the rate limit. hot threshold is tweakable via debugfs. So it's an unstable interface, undocumented, may be withdrawn at any time, etc. Please justify this. Is it intended that this interface be removed? If yes, please describe the plan for this in the changelog. If no then this interface should be in sysfs, it should be fully documented in the kernel tree and it should be fully changelogged (preferably with usage examples) so that reviewers can best understand what is a permanent extension to the kernel API which we must maintain for all time. Does this code build correctly if !LAST_CPUPID_NOT_IN_PAGE_FLAGS? > TODO: Add ABI document for new sysctl knob. um, yes. check_cpupid() is a poor function name. Check for what? Liver damage? A better identifier would be cpupid_valid(), perhaps. And perhaps it should be implemented in mm.h. And even documented. Oddly, the proposed changes do a decent job of documenting intra-function changes but a poor job of documenting newly added functions. Please review every new function and decide whether it is so simple and obvious that it can be added to Linux without any inline description at all. pgdat_free_space_enough() and numa_hint_fault_latency(). numa_migration_check_rate_limit() particularly. There's that "check" word again. Check for what? The numa_balancing_rate_limit_mbps sysctl. I assume "b" means "bytes". That's better than "pages", but still. Machines vary a lot in how many bytes they have and in the speed at which they process those bytes. Is there some metric which we can use here which at least partially insulates us from this variability? So that sysadmins don't need to set all their machines differently, based upon their size and speed? numa_threshold_ts is apparently in units of jiffies. This was not documented at the definition site, and it should be. But jiffies can vary between machines and configs. Why add this inter-machine and inter-config uncertainty/variability when we have include/linux/ktime.h?