On Wed, Jul 3, 2024 at 7:29 PM Huan Yang <link@xxxxxxxx> wrote: > > > 在 2024/7/4 6:59, T.J. Mercier 写道: > > On Tue, Jul 2, 2024 at 7:23 PM Huan Yang <link@xxxxxxxx> wrote: > >> > >> 在 2024/7/3 3:27, Roman Gushchin 写道: > >>> On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote: > >>>> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE). > >>>> > >>>> Background > >>>> === > >>>> > >>>> Modern computer systems always have performance gaps between hardware, > >>>> such as the performance differences between CPU, memory, and disk. > >>>> Due to the principle of locality of reference in data access: > >>>> > >>>> Programs often access data that has been accessed before > >>>> Programs access the next set of data after accessing a particular data > >>>> As a result: > >>>> 1. CPU cache is used to speed up the access of already accessed data > >>>> in memory > >>>> 2. Disk prefetching techniques are used to prepare the next set of data > >>>> to be accessed in advance (to avoid direct disk access) > >>>> The basic utilization of locality greatly enhances computer performance. > >>>> > >>>> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance > >>>> program performance. > >>>> > >>>> In modern computers, especially in smartphones, services are provided to > >>>> users on a per-application basis (such as Camera, Chat, etc.), > >>>> where an application is composed of multiple processes working together to > >>>> provide services. > >>>> > >>>> The basic unit for managing resources in a computer is the process, > >>>> which in turn uses threads to share memory and accomplish tasks. > >>>> Memory is shared among threads within a process. > >>>> > >>>> However, modern computers have the following issues, with a locality deficiency: > >>>> > >>>> 1. Different forms of memory exist and are not interconnected (anonymous > >>>> pages, file pages, special memory such as DMA-BUF, various memory alloc in > >>>> kernel mode, etc.) > >>>> 2. Memory isolation exists between processes, and apart from specific > >>>> shared memory, they do not communicate with each other. > >>>> 3. During the transition of functionality within an application, a process > >>>> usually releases memory, while another process requests memory, and in > >>>> this process, memory has to be obtained from the lowest level through > >>>> competition. > >>>> > >>>> For example abount camera application: > >>>> > >>>> Camera applications typically provide photo capture services as well as photo > >>>> preview services. > >>>> The photo capture process usually utilizes DMA-BUF to facilitate the sharing > >>>> of image data between the CPU and DMA devices. > >>>> When it comes to image preview, multiple algorithm processes are typically > >>>> involved in processing the image data, which may also involve heap memory > >>>> and other resources. > >>>> > >>>> During the switch between photo capture and preview, the application typically > >>>> needs to release DMA-BUF memory and then the algorithms need to allocate > >>>> heap memory. The flow of system memory during this process is managed by > >>>> the PCP-BUDDY system. > >>>> > >>>> However, the PCP and BUDDY systems are shared, and subsequently requested > >>>> memory may not be available due to previously allocated memory being used > >>>> (such as for file reading), requiring a competitive (memory reclamation) > >>>> process to obtain it. > >>>> > >>>> So, if it is possible to allow the released memory to be allocated with > >>>> high priority within the application, then this can meet the locality > >>>> requirement, improve performance, and avoid unnecessary memory reclaim. > >>>> > >>>> PMC solutions are similar to PCP, as they both establish cache pools according > >>>> to certain rules. > >>>> > >>>> Why base on MEMCG? > >>>> === > >>>> > >>>> The MEMCG container can allocate selected processes to a MEMCG based on certain > >>>> grouping strategies (typical examples include grouping by app or UID). > >>>> Processes within the same MEMCG can then be used for statistics, upper limit > >>>> restrictions, and reclamation control. > >>>> > >>>> All processes within a MEMCG are considered as a single memory unit, > >>>> sharing memory among themselves. As a result, when one process releases > >>>> memory, another process within the same group can obtain it with the > >>>> highest priority, fully utilizing the locality of memory allocation > >>>> characteristics within the MEMCG (such as APP grouping). > >>>> > >>>> In addition, MEMCG provides feature interfaces that can be dynamically toggled > >>>> and are fully controllable by the policy.This provides greater flexibility > >>>> and does not impact performance when not enabled (controlled through static key). > >>>> > >>>> > >>>> Abount PMC implement > >>>> === > >>>> Here, a cache switch is provided for each MEMCG(not on root). > >>>> When the user enables the cache, processes within the MEMCG will share memory > >>>> through this cache. > >>>> > >>>> The cache pool is positioned before the PCP. All order0 page released by > >>>> processes in MEMCG will be released to the cache pool first, and when memory > >>>> is requested, it will also be prioritized to be obtained from the cache pool. > >>>> > >>>> `memory.cache` is the sole entry point for controlling PMC, here are some > >>>> nested keys to control PMC: > >>>> 1. "enable=[y|n]" to enable or disable targeted MEMCG's cache > >>>> 2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already > >>>> enabled PMC's behavior. > >>>> a) `nid` to targeted a node to change it's key. or else all node. > >>>> b) The `watermark` is used to control cache behavior, caching only when > >>>> zone free pages above the zone's high water mark + this watermark is > >>>> exceeded during memory release. (unit byte, default 50MB, > >>>> min 10MB per-node-all-zone) > >>>> c) `reaper_time` to control reaper gap, if meet, reaper all cache in this > >>>> MEMCG(unit us, default 5s, 0 is disable.) > >>>> d) `limit` is to limit the maximum memory used by the cache pool(unit bytes, > >>>> default 100MB, max 500MB per-node-all-zone) > >>>> > >>>> Performance > >>>> === > >>>> PMC is based on MEMCG and requires performance measurement through the > >>>> sharing of complex workloads between application processes. > >>>> Therefore, at the moment, we unable to provide a better testing solution > >>>> for this patchset. > >>>> > >>>> Here is the internal testing situation we provide, using the camera > >>>> application as an example. (1-NODE-1-ZONE-8GRAM) > >>>> > >>>> Test Case: Capture in rear portrait HDR mode > >>>> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram > >>>> which memory types including dmabuf(470M), PSS(150M) and APU(200M) > >>>> 2. Test steps: take a photo, then click thumbnail to view the full image > >>>> > >>>> The overall performance benefit from click shutter button to showing whole > >>>> image improves 500ms, and the total slowpath cost of all camera threads reduced > >>>> from 958ms to 495ms. > >>>> Especially for the shot2shot in this mode, the preview dealy of each frame have > >>>> a significant improve. > >>> Hello Huan, > >>> > >>> thank you for sharing your work. > >> thanks > >>> Some high-level thoughts: > >>> 1) Naming is hard, but it took me quite a while to realize that you're talking > >> Haha, sorry for my pool english > >>> about free memory. Cache is obviously an overloaded term, but per-memcg-cache > >>> can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not > >> Currently, my idea is that all memory released by processes under memcg > >> will go into the `cache`, > >> > >> and the original attributes will be ignored, and can be freely requested > >> by processes under memcg. > >> > >> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more > >> friendly? :) > >> > >>> the best choice. > >>> 2) Overall an idea to have a per-memcg free memory pool makes sense to me, > >>> especially if we talk 2MB or 1GB pages (or order > 0 in general). > >> I like it too :) > >>> 3) You absolutely have to integrate the reclaim mechanism with a generic > >>> memory reclaim mechanism, which is driven by the memory pressure. > >> Yes, I all think about it. > >>> 4) You claim a ~50% performance win in your workload, which is a lot. It's not > >>> clear to me where it's coming from. It's hard to believe the page allocation/release > >>> paths are taking 50% of the cpu time. Please, clarify. > >> Let me describe it more specifically. In our test scenario, we have 8GB > >> of RAM, and our camera application > >> > >> has a complex set of algorithms, with a peak memory requirement of up to > >> 3GB. > >> > >> Therefore, in a multi-application background scenario, starting the > >> camera and taking photos will create a > >> > >> very high memory pressure. In this scenario, any released memory will be > >> quickly used by other processes (such as file pages). > >> > >> So, during the process of switching from camera capture to preview, > >> DMA-BUF memory will be released, > >> > >> while the memory used for the preview algorithm will be simultaneously > >> requested. > >> > >> We need to take a lot of slow path routes to obtain enough memory for > >> the preview algorithm, and it seems that the > >> > >> just released DMA-BUF memory does not provide much help. > >> > > Hi Huan, > HI T.J. > > > > I find this part surprising. Assuming the dmabuf memory doesn't first > > go into a page pool (used for some buffers, not all) and actually does > Actually, when PMC enabled, we let page free avoid free into page pool. > > get freed synchronously with fput, this would mean it gets sucked up > > by other supposedly background processes before it can be allocated by > > the preview process. I thought the preview process was the one most > > desperate for memory? You mention file pages, but where is this > > newly-freed memory actually going if not to the preview process? My > This was discovered through the meminfo observation program. > When the dma-buf is released, there is a noticeable increase in cache. > > This may be triggered by pagecache when loading the algorithm model. > > Additionally, the algorithm heap memory cannot benefit from the release > of the dma-buf. > I believe this is related to the migratetype. The stack/heap cannot > obtain priority access to > the dma-buf memory released by the kernel.(HIGHUSER_MOVABLE) > > So, PMC break it, share each memory. Even if it's incorrect :)(If my > understanding of the > fragmentation issue is incorrect, please correct me.) > Oh that would make sense, but then the memory *is* going to your preview process just not in the form you were hoping for. So model loading and your heap allocations were fighting for memory, probably thrashing the file pages? I guess it's more important to get the heap allocations done first for performance for your app, and I think I can understand how PMC would give a sort of priority to those over the file pages during the preview transition. Ok. Sorry I don't have an opinion on this part yet if that's what's happening. > > initial reaction was the same as Roman's that the PMC should be hooked > > up to reclaim instead of depending on the reaper. But I think this > > might suggest that wouldn't work because the system is under such high > > memory pressure that it'd be likely reclaim would have emptied the > > PMCs before the preview process could use it. > The point you raised is indeed very likely to happen, as there is immense > memory pressure. > Currently, we only open the PMC when the application is in the foreground, > and close it when it goes to the background. > It is indeed unnecessary to drain the PMC when the application is in the > foreground, > and a longer reaper timeout would be more useful.(Thanks for the > flexibility provided by memcg.) > > > > One more thing I find odd is that for this to work a significant > > portion of your dmabuf pages would have to be order 0, but we're > > talking about a ~500M buffer. Does whatever exports this buffer not > > try to use higher order pages like here? > Yes, actually our heap configured order 8 4 0, but In our practical > application and observation processes, > it is often difficult to meet the high-order memory allocation, so > falling back to order 0 is the most common. > Therefore, for our MID_ORDER allocation, we use LOW_ORDER_GFP. > Just like the testing scenario I mentioned earlier, with 8GB of RAM and > the camera peaking at around 3GB, > > the fragmentation at this point will cause most of the DMA-BUF > allocations to fall back to order 0. > The use of PMC is for real-world, high-load applications. I don't think > it's very practical for regular applications. Got it, thanks. > > Thanks > HY > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/heaps/system_heap.c?h=v6.9#n54 > > > > Thanks! > > -T.J. > > > >> But using PMC (let's call it that for now), we are able to quickly meet > >> the memory needs of the subsequent preview process > >> > >> with the just released DMA-BUF memory, without having to go through the > >> slow path, resulting in a significant performance improvement. > >> > >> (of course, break migrate type may not good.) > >> > >>> There are a lot of other questions, and you highlighted some of them below > >>> (and these are indeed right questions to ask), but let's start with something. > >>> > >>> Thanks > >> Thanks > >>