Re: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM

"Huang\, Ying" <ying.huang@xxxxxxxxx> · Wed, 19 Feb 2020 14:05:09 +0800

Mel Gorman <mgorman@xxxxxxx> writes:

> On Tue, Feb 18, 2020 at 04:26:29PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@xxxxxxxxx>
>> 
>> In a memory tiering system, if the memory size of the workloads is
>> smaller than that of the faster memory (e.g. DRAM) nodes, all pages of
>> the workloads should be put in the faster memory nodes.  But this
>> makes it unnecessary to use slower memory (e.g. PMEM) at all.
>> 
>> So in common cases, the memory size of the workload should be larger
>> than that of the faster memory nodes.  And to optimize the
>> performance, the hot pages should be promoted to the faster memory
>> nodes while the cold pages should be demoted to the slower memory
>> nodes.  To achieve that, we have two choices,
>> 
>> a. Promote the hot pages from the slower memory node to the faster
>>    memory node.  This will create some memory pressure in the faster
>>    memory node, thus trigger the memory reclaiming, where the cold
>>    pages will be demoted to the slower memory node.
>> 
>> b. Demote the cold pages from faster memory node to the slower memory
>>    node.  This will create some free memory space in the faster memory
>>    node, and the hot pages in the slower memory node could be promoted
>>    to the faster memory node.
>> 
>> The choice "a" will create the memory pressure in the faster memory
>> node.  If the memory pressure of the workload is high too, the memory
>> pressure may become so high that the memory allocation latency of the
>> workload is influenced, e.g. the direct reclaiming may be triggered.
>> 
>> The choice "b" works much better at this aspect.  If the memory
>> pressure of the workload is high, it will consume the free memory and
>> the hot pages promotion will stop earlier if its allocation watermark
>> is higher than that of the normal memory allocation.
>> 
>> In this patch, choice "b" is implemented.  If memory tiering NUMA
>> balancing mode is enabled, the node isn't the slowest node, and the
>> free memory size of the node is below the high watermark, the kswapd
>> of the node will be waken up to free some memory until the free memory
>> size is above the high watermark + autonuma promotion rate limit.  If
>> the free memory size is below the high watermark, autonuma promotion
>> will stop working.  This avoids to create too much memory pressure to
>> the system.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
>
> Unfortunately I stopped reading at this point. It depends on another series
> entirely and they really need to be presented together instead of relying
> on searching mail archives to find other patches to try assemble the full
> picture :(. Ideally each stage would have supporting data showing roughly
> how it behaves at each major stage. I know this will be a pain but the
> original NUMA balancing had the same problem and ultimately started with
> one large series that got the basics right followed by other series that
> improved it in stages. That process is *still* ongoing today.

Sorry for inconvenience, we will post a new patchset including both
series and add supporting data at each major stage when possible.

Best Regards,
Huang, Ying