Hi, Andrew, Thanks a lots for your comments! On Fri, 2022-04-08 at 21:07 -0700, Andrew Morton wrote: > On Fri, 8 Apr 2022 15:12:19 +0800 Huang Ying <ying.huang@xxxxxxxxx> > wrote: > > > To optimize page placement in a memory tiering system with NUMA > > balancing, the hot pages in the slow memory node need to be > > identified. Essentially, the original NUMA balancing > > implementation > > selects and promote the mostly recently accessed (MRU) pages. But > > the > > recently accessed pages may be cold. > > Wait. What. How on earth could the most recently accessed page be > considered "cold"? > > Changelog needs work, please. Sorry for confusing. What I really want to say is, The pages with low accessing frequency should be kept in slow memory node. But these pages will still be accessed, so they may be selected and migrated to the fast memory node by the original NUMA balancing implementation upon accessing. Are the new words better? > > So in this patchset, we > > implement a new hot page identification algorithm based on the > > latency > > between NUMA balancing page table scanning and hint page fault. > > That sounds reasonable. > > > And the hot page promotion can incur some overhead in the system. > > To > > control the overhead a simple promotion rate limit mechanism is > > implemented. > > That sounds like a hack to fix a problem you just added? Maybe a > reasonable one.. Sorry again for confusing. Try to reword it as follows, If there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. One choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the fast promoting/demoting may hurt the performance of the workload, e.g., the workload latency. One way to resolve this issue is to restrict the max promoting/demoting bandwidth. It will take longer to finish the promoting/demoting. But the workload latency may be better. This is implemented in this patchset as the page promotion rate limit mechanism. > > The hot threshold used to identify the hot pages is workload > > dependent > > usually. So we also implemented a hot threshold automatic > > adjustment > > algorithm. The basic idea is to increase/decrease the hot > > threshold > > to make the number of pages that pass the hot threshold (promote > > candidate) near the rate limit. > > hot threshold is tweakable via debugfs. So it's an unstable > interface, > undocumented, may be withdrawn at any time, etc. > > Please justify this. Is it intended that this interface be removed? > If yes, please describe the plan for this in the changelog. If no > then > this interface should be in sysfs, it should be fully documented in > the > kernel tree and it should be fully changelogged (preferably with > usage > examples) so that reviewers can best understand what is a permanent > extension to the kernel API which we must maintain for all time. Previously, I had added a hot threshold sysfs interface. But later, I thought that it's hard for the users to determine right hot threshold value. On the other hand, in patch [3/3], we implemented a hot threshold automatic adjustment algorithm, which make it not so important to set the hot threshold from the user space. So I moved the hot threshold setting interface to debugfs just for users to experiment. If it turns out that a formal interface is needed finally, I will add one. > Does this code build correctly if !LAST_CPUPID_NOT_IN_PAGE_FLAGS? > I have relied on 0-Day kernel build service to capture the build bug of more configurations and architectures. But I just tried i386 architcture, and found that NUMA balancing isn't supported there, so I cannot build test that for i386 too. Should I try more? > > TODO: Add ABI document for new sysctl knob. > > um, yes. > Sure. Will do that in the next version. > check_cpupid() is a poor function name. Check for what? Liver > damage? > A better identifier would be cpupid_valid(), perhaps. And perhaps it > should be implemented in mm.h. And even documented. Yes. cpupid_valid() is much better. I will add document for it in the next version. > Oddly, the proposed changes do a decent job of documenting > intra-function changes but a poor job of documenting newly added > functions. Please review every new function and decide whether it is > so simple and obvious that it can be added to Linux without any > inline > description at all. > > pgdat_free_space_enough() and numa_hint_fault_latency(). Sure. Will check all new functions and add document if necessary. > numa_migration_check_rate_limit() particularly. There's that "check" > word again. Check for what? > The original idea is to check whether the migration is ratelimited. Will change this to numa_migration_rate_limit(). It will return true if migration rate exceed the limit. > > The numa_balancing_rate_limit_mbps sysctl. I assume "b" means > "bytes". > That's better than "pages", but still. Machines vary a lot in how > many bytes they have and in the speed at which they process those > bytes. Is there some metric which we can use here which at least > partially insulates us from this variability? So that sysadmins > don't > need to set all their machines differently, based upon their size and > speed? Will change the sysctl to numb_balancing_promote_rate_limit_MBps to avoid possible confusing. The default rate limit is 64 GB/s, that is, no limitation. As Mel Gorman suggested. To reduce the possible disturbing to the workload, A rule of thumb is to set it to be 10% of the max slow memory bandwidth. But we don't know how to detect that automatically yet. > numa_threshold_ts is apparently in units of jiffies. This was not > documented at the definition site, and it should be. But jiffies can > vary between machines and configs. Why add this inter-machine and > inter-config uncertainty/variability when we have > include/linux/ktime.h? > We need to call cmpxchg() on it, so ktime_t may be not the best choice here. But I got your idea. I will use some more general time unit, e.g. milli-second. Best Regards, Huang, Ying