Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> writes: > On 6/14/2022 4:16 PM, Huang Ying wrote: >> To optimize page placement in a memory tiering system with NUMA >> balancing, the hot pages in the slow memory nodes need to be >> identified. Essentially, the original NUMA balancing implementation >> selects the mostly recently accessed (MRU) pages to promote. But this >> isn't a perfect algorithm to identify the hot pages. Because the >> pages with quite low access frequency may be accessed eventually given >> the NUMA balancing page table scanning period could be quite long >> (e.g. 60 seconds). So in this patchset, we implement a new hot page >> identification algorithm based on the latency between NUMA balancing >> page table scanning and hint page fault. Which is a kind of mostly >> frequently accessed (MFU) algorithm. >> In NUMA balancing memory tiering mode, if there are hot pages in >> slow >> memory node and cold pages in fast memory node, we need to >> promote/demote hot/cold pages between the fast and cold memory nodes. >> A choice is to promote/demote as fast as possible. But the CPU >> cycles >> and memory bandwidth consumed by the high promoting/demoting >> throughput will hurt the latency of some workload because of accessing >> inflating and slow memory bandwidth contention. >> A way to resolve this issue is to restrict the max >> promoting/demoting >> throughput. It will take longer to finish the promoting/demoting. >> But the workload latency will be better. This is implemented in this >> patchset as the page promotion rate limit mechanism. >> The promotion hot threshold is workload and system configuration >> dependent. So in this patchset, a method to adjust the hot threshold >> automatically is implemented. The basic idea is to control the number >> of the candidate promotion pages to match the promotion rate limit. >> We used the pmbench memory accessing benchmark tested the patchset >> on >> a 2-socket server system with DRAM and PMEM installed. The test >> results are as follows, >> pmbench score promote rate >> (accesses/s) MB/s >> ------------- ------------ >> base 146887704.1 725.6 >> hot selection 165695601.2 544.0 >> rate limit 162814569.8 165.2 >> auto adjustment 170495294.0 136.9 >> From the results above, >> With hot page selection patch [1/3], the pmbench score increases >> about >> 12.8%, and promote rate (overhead) decreases about 25.0%, compared with >> base kernel. >> With rate limit patch [2/3], pmbench score decreases about 1.7%, and >> promote rate decreases about 69.6%, compared with hot page selection >> patch. >> With threshold auto adjustment patch [3/3], pmbench score increases >> about 4.7%, and promote rate decrease about 17.1%, compared with rate >> limit patch. > > I did a simple testing with mysql on my machine which contains 1 DRAM > node (30G) and 1 PMEM node (126G). > > sysbench /usr/share/sysbench/oltp_read_write.lua \ > ...... > --tables=200 \ > --table-size=1000000 \ > --report-interval=10 \ > --threads=16 \ > --time=120 > > The tps can be improved about 5% from below data, and I think this is > a good start to optimize the promotion. So for this series, please > feel free to add: > > Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > Tested-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > > Without this patchset: > transactions: 2080188 (3466.48 per sec.) > > With this patch set: > transactions: 2174296 (3623.40 per sec.) Thanks a lot! Best Regards, Huang, Ying