On Fri, May 13, 2022 at 12:04 AM ying.huang@xxxxxxxxx <ying.huang@xxxxxxxxx> wrote: > > On Thu, 2022-05-12 at 23:36 -0700, Wei Xu wrote: > > On Thu, May 12, 2022 at 8:25 PM ying.huang@xxxxxxxxx > > <ying.huang@xxxxxxxxx> wrote: > > > > > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote: > > > > > > > > Memory Allocation for Demotion > > > > ============================== > > > > > > > > To allocate a new page as the demotion target for a page, the kernel > > > > calls the allocation function (__alloc_pages_nodemask) with the > > > > source page node as the preferred node and the union of all lower > > > > tier nodes as the allowed nodemask. The actual target node selection > > > > then follows the allocation fallback order that the kernel has > > > > already defined. > > > > > > > > The pseudo code looks like: > > > > > > > > targets = NODE_MASK_NONE; > > > > src_nid = page_to_nid(page); > > > > src_tier = node_tier_map[src_nid]; > > > > for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++) > > > > nodes_or(targets, targets, memory_tiers[i]); > > > > new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets); > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can > > > > be set to refine the demotion target nodemask, e.g. to prevent > > > > demotion or select a particular allowed node as the demotion target. > > > > > > Consider a system with 3 tiers, if we want to demote some pages from > > > tier 0, the desired behavior is, > > > > > > - Allocate pages from tier 1 > > > - If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so > > > demote some pages from tier 1 to tier 2 > > > - If there's still no enough free pages in tier 1, allocate pages from > > > tier 2. > > > > > > In this way, tier 0 will have the hottest pages, while tier 1 will have > > > the coldest pages. > > > > When we are already in the allocation path for the demotion of a page > > from tier 0, I think we'd better not block this allocation to wait for > > kswapd to demote pages from tier 1 to tier 2. Instead, we should > > directly allocate from tier 2. Meanwhile, this demotion can wakeup > > kswapd to demote from tier 1 to tier 2 in the background. > > Yes. That's what I want too. My original words may be misleading. > > > > With your proposed method, the demoting from tier 0 behavior is, > > > > > > - Allocate pages from tier 1 > > > - If there's no enough free pages in tier 1, allocate pages in tier 2 > > > > > > The kswapd of tier 1 will not be waken up until there's no enough free > > > pages in tier 2. In quite long time, there's no much hot/cold > > > differentiation between tier 1 and tier 2. > > > > This is true with the current allocation code. But I think we can make > > some changes for demotion allocations. For example, we can add a > > GFP_DEMOTE flag and update the allocation function to wake up kswapd > > when this flag is set and we need to fall back to another node. > > > > > This isn't hard to be fixed, just call __alloc_pages_nodemask() for each > > > tier one by one considering page allocation fallback order. > > > > That would have worked, except that there is an example earlier, in > > which it is actually preferred for some nodes to demote to their tier > > + 2, not tier +1. > > > > More specifically, the example is: > > > > 20 > > Node 0 (DRAM) -- Node 1 (DRAM) > > | | | | > > | | 30 120 | | > > | v v | 100 > > 100 | Node 2 (PMEM) | > > | | | > > | | 100 | > > \ v v > > -> Node 3 (Large Mem) > > > > Node distances: > > node 0 1 2 3 > > 0 10 20 30 100 > > 1 20 10 120 100 > > 2 30 120 10 100 > > 3 100 100 100 10 > > > > 3 memory tiers are defined: > > tier 0: 0-1 > > tier 1: 2 > > tier 2: 3 > > > > The demotion fallback order is: > > node 0: 2, 3 > > node 1: 3, 2 > > node 2: 3 > > node 3: empty > > > > Note that even though node 3 is in tier 2 and node 2 is in tier 1, > > node 1 (tier 0) still prefers node 3 as its first demotion target, not > > node 2. > > Yes. I understand that we need to support this use case. We can use > the tier order in allocation fallback list instead of from small to > large. That is, for node 1, the tier order for demotion is tier 2, tier > 1. That could work, too, though I feel it might be simpler and more efficient (no repeated calls to __alloc_pages for the same allocation) to modify __alloc_pages() itself. Anyway, we can discuss more on this when it comes to the implementation of this demotion allocation function. I believe this should not affect the general memory tiering interfaces proposed here. > Best Regards, > Huang, Ying > > >