[RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)

Huan Yang <link@xxxxxxxx> · Tue, 2 Jul 2024 16:44:03 +0800

This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE).

Background
===

Modern computer systems always have performance gaps between hardware,
such as the performance differences between CPU, memory, and disk.
Due to the principle of locality of reference in data access:

  Programs often access data that has been accessed before
  Programs access the next set of data after accessing a particular data
As a result:
  1. CPU cache is used to speed up the access of already accessed data
     in memory
  2. Disk prefetching techniques are used to prepare the next set of data
     to be accessed in advance (to avoid direct disk access)
The basic utilization of locality greatly enhances computer performance.

PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance
program performance.

In modern computers, especially in smartphones, services are provided to
users on a per-application basis (such as Camera, Chat, etc.),
where an application is composed of multiple processes working together to
provide services.

The basic unit for managing resources in a computer is the process,
which in turn uses threads to share memory and accomplish tasks.
Memory is shared among threads within a process.

However, modern computers have the following issues, with a locality deficiency:

  1. Different forms of memory exist and are not interconnected (anonymous
     pages, file pages, special memory such as DMA-BUF, various memory alloc in
     kernel mode, etc.)
  2. Memory isolation exists between processes, and apart from specific
     shared memory, they do not communicate with each other.
  3. During the transition of functionality within an application, a process
     usually releases memory, while another process requests memory, and in
     this process, memory has to be obtained from the lowest level through
     competition.

For example abount camera application:

Camera applications typically provide photo capture services as well as photo
preview services.
The photo capture process usually utilizes DMA-BUF to facilitate the sharing
of image data between the CPU and DMA devices.
When it comes to image preview, multiple algorithm processes are typically
involved in processing the image data, which may also involve heap memory
and other resources.

During the switch between photo capture and preview, the application typically
needs to release DMA-BUF memory and then the algorithms need to allocate
heap memory. The flow of system memory during this process is managed by
the PCP-BUDDY system.

However, the PCP and BUDDY systems are shared, and subsequently requested
memory may not be available due to previously allocated memory being used
(such as for file reading), requiring a competitive (memory reclamation)
process to obtain it.

So, if it is possible to allow the released memory to be allocated with
high priority within the application, then this can meet the locality
requirement, improve performance, and avoid unnecessary memory reclaim.

PMC solutions are similar to PCP, as they both establish cache pools according
to certain rules.

Why base on MEMCG?
===

The MEMCG container can allocate selected processes to a MEMCG based on certain
grouping strategies (typical examples include grouping by app or UID).
Processes within the same MEMCG can then be used for statistics, upper limit
restrictions, and reclamation control.

All processes within a MEMCG are considered as a single memory unit,
sharing memory among themselves. As a result, when one process releases
memory, another process within the same group can obtain it with the
highest priority, fully utilizing the locality of memory allocation
characteristics within the MEMCG (such as APP grouping).

In addition, MEMCG provides feature interfaces that can be dynamically toggled
and are fully controllable by the policy.This provides greater flexibility
and does not impact performance when not enabled (controlled through static key).

Abount PMC implement
===
Here, a cache switch is provided for each MEMCG(not on root).
When the user enables the cache, processes within the MEMCG will share memory
through this cache.

The cache pool is positioned before the PCP. All order0 page released by
processes in MEMCG will be released to the cache pool first, and when memory
is requested, it will also be prioritized to be obtained from the cache pool.

`memory.cache` is the sole entry point for controlling PMC, here are some
nested keys to control PMC:
  1. "enable=[y|n]" to enable or disable targeted MEMCG's cache
  2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already
  enabled PMC's behavior.
    a) `nid` to targeted a node to change it's key. or else all node.
    b) The `watermark` is used to control cache behavior, caching only when
       zone free pages above the zone's high water mark + this watermark is
       exceeded during memory release. (unit byte, default 50MB,
       min 10MB per-node-all-zone)
    c) `reaper_time` to control reaper gap, if meet, reaper all cache in this
        MEMCG(unit us, default 5s, 0 is disable.)
    d) `limit` is to limit the maximum memory used by the cache pool(unit bytes,
       default 100MB, max 500MB per-node-all-zone)

Performance
===
PMC is based on MEMCG and requires performance measurement through the
sharing of complex workloads between application processes.
Therefore, at the moment, we unable to provide a better testing solution
for this patchset.

Here is the internal testing situation we provide, using the camera
application as an example. (1-NODE-1-ZONE-8GRAM)

Test Case: Capture in rear portrait HDR mode
1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram
   which memory types including dmabuf(470M), PSS(150M) and APU(200M)
2. Test steps: take a photo, then click thumbnail to view the full image

The overall performance benefit from click shutter button to showing whole
image improves 500ms, and the total slowpath cost of all camera threads reduced
from 958ms to 495ms. 
Especially for the shot2shot in this mode, the preview dealy of each frame have
a significant improve.

Some question
===
1. The current patchset ignores the migrate type because the original
   requirement is to share between DMA-BUF and heap memory. However,
   this behavior will cause serious system fragmentation,
   so is there a better solution?

2. Current patchset only supports order 0 and use reaper to reclaim cache.
   Maybe better adapt to drain work and high order. 

3. Actually, above internal test set cache pool free before pcp, and alloc
   behind buddy free. So task will push common memory, and cace will only be
   used in emergency situations.(before into slowpath). This will result in
   better performance, but it may impact the system. Even if only when
   application start up, cache enable. So, which better?

4. Current patchset is simple to talk, some struct maybe need refcount/lock to
   fix race access.

Huan Yang (4):
  mm: memcg: pmc framework
  mm: memcg: pmc support change attribute
  mm: memcg: pmc: support reaper
  mm: memcg: pmc: support oom release

 include/linux/memcontrol.h |  41 ++++
 include/linux/mmzone.h     |  34 +++
 include/linux/swap.h       |   1 +
 mm/memcontrol.c            | 481 +++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            | 147 ++++++++++++
 5 files changed, 704 insertions(+)

base-commit: 727900b675b749c40ba1f6669c7ae5eb7eb8e837
-- 
2.45.2