Hi Srivatsa, On 04/10/2013 05:45 AM, Srivatsa S. Bhat wrote:
[I know, this cover letter is a little too long, but I wanted to clearly explain the overall goals and the high-level design of this patchset in detail. I hope this helps more than it annoys, and makes it easier for reviewers to relate to the background and the goals of this patchset.] Overview of Memory Power Management and its implications to the Linux MM ======================================================================== Today, we are increasingly seeing computer systems sporting larger and larger amounts of RAM, in order to meet workload demands. However, memory consumes a significant amount of power, potentially upto more than a third of total system power on server systems. So naturally, memory becomes the next big target for power management - on embedded systems and smartphones, and all the way upto large server systems. Power-management capabilities in modern memory hardware: ------------------------------------------------------- Modern memory hardware such as DDR3 support a number of power management capabilities - for instance, the memory controller can automatically put
memory controller is integrated in cpu in NUMA system and mount on PCI-E in UMA, correct? How can memory controller know which memory DIMMs/banks it will control?
memory DIMMs/banks into content-preserving low-power states, if it detects that that *entire* memory DIMM/bank has not been referenced for a threshold amount of time, thus reducing the energy consumption of the memory hardware. We term these power-manageable chunks of memory as "Memory Regions". Exporting memory region info of the platform to the OS: ------------------------------------------------------ The OS needs to know about the granularity at which the hardware can perform automatic power-management of the memory banks (i.e., the address boundaries of the memory regions). On ARM platforms, the bootloader can be modified to pass on this info to the kernel via the device-tree. On x86 platforms, the new ACPI 5.0 spec has added support for exporting the power-management capabilities of the memory hardware to the OS in a standard way[5]. Estimate of power-savings from power-aware Linux MM: --------------------------------------------------- Once the firmware/bootloader exports the required info to the OS, it is upto the kernel's MM subsystem to make the best use of these capabilities and manage memory power-efficiently. It had been demonstrated on a Samsung Exynos board (with 2 GB RAM) that upto 6 percent of total system power can be saved by making the Linux kernel MM subsystem power-aware[4]. (More savings can be expected on systems with larger amounts of memory, and perhaps improved further using better MM designs).
How to know there are 6 percent of total system power can be saved by making the Linux kernel MM subsystem power-aware?
Role of the Linux MM in enhancing memory power savings: ------------------------------------------------------ Often, this simply translates to having the Linux MM understand the granularity at which RAM modules can be power-managed, and keeping the memory allocations and references consolidated to a minimum no. of these power-manageable "memory regions". It is of particular interest to note that most of these memory hardware have the intelligence to automatically save power, by putting memory banks into (content-preserving) low-power states when not referenced for a
How to know DIMM/bank is not referenced?
threshold amount of time. All that the kernel has to do, is avoid wrecking the power-savings logic by scattering its allocations and references all over the system memory. (The kernel/MM doesn't have to perform the actual power-state transitions; its mostly done in the hardware automatically, and this is OK because these are *content-preserving* low-power states). So we can summarize the goals for the Linux MM as: o Consolidate memory allocations and/or references such that they are not spread across the entire memory address space. Basically the area of memory that is not being referenced can reside in low power state. o Support light-weight targetted memory compaction/reclaim, to evacuate lightly-filled memory regions. This helps avoid memory references to those regions, thereby allowing them to reside in low power states. Assumptions and goals of this patchset: -------------------------------------- In this patchset, we don't handle the part of getting the region boundary info from the firmware/bootloader and populating it in the kernel data-structures. The aim of this patchset is to propose and brainstorm on a power-aware design of the Linux MM which can *use* the region boundary info to influence the MM at various places such as page allocation, reclamation/compaction etc, thereby contributing to memory power savings. (This patchset is very much an RFC at the moment and is not intended for mainline-inclusion yet). So, in this patchset, we assume a simple model in which each 512MB chunk of memory can be independently power-managed, and hard-code this into the patchset. As mentioned, the focus of this patchset is not so much on how we get this info from the firmware or how exactly we handle a variety of configurations, but rather on discussing the power-savings/performance impact of the MM algorithms that *act* upon this info in order to save memory power. That said, its not very far-fetched to try this out with actual region boundary info to get the actual power savings numbers. For example, on ARM platforms, we can make the bootloader export this info to the OS via device-tree and then run this patchset. (This was the method used to get the power-numbers in [4]). But even without doing that, we can very well evaluate the effectiveness of this patchset in contributing to power-savings, by analyzing the free page statistics per-memory-region; and we can observe the performance impact by running benchmarks - this is the approach currently used to evaluate this patchset. Brief overview of the design/approach used in this patchset: ----------------------------------------------------------- This patchset implements the 'Sorted-buddy design' for Memory Power Management, in which the buddy (page) allocator is altered to keep the buddy freelists region-sorted, which helps influence the page allocation paths to keep the
If this will impact normal zone based buddy freelists?
allocations consolidated to a minimum no. of memory regions. This patchset also includes a light-weight targetted compaction/reclaim algorithm that works hand-in-hand with the page-allocator, to evacuate lightly-filled memory regions when memory gets fragmented, in order to further enhance memory power savings. This Sorted-buddy design was developed based on some of the suggestions received[1] during the review of the earlier patchset on Memory Power Management written by Ankita Garg ('Hierarchy design')[2]. One of the key aspects of this Sorted-buddy design is that it avoids the zone-fragmentation problem that was present in the earlier design[3]. Design of sorted buddy allocator and light-weight targetted region compaction: ============================================================================= Sorted buddy allocator: ---------------------- In this design, the memory region boundaries are captured in a data structure parallel to zones, instead of fitting regions between nodes and zones in the hierarchy. Further, the buddy allocator is altered, such that we maintain the zones' freelists in region-sorted-order and thus do page allocation in the order of increasing memory regions. (The freelists need not be fully address-sorted, they just need to be region-sorted). The idea is to do page allocation in increasing order of memory regions (within a zone) and perform region-compaction in the reverse order, as illustrated below. ---------------------------- Increasing region number----------------------> Direction of allocation---> <---Direction of region-compaction The sorting logic (to maintain freelist pageblocks in region-sorted-order) lies in the page-free path and hence the critical page-allocation paths remain fast. Also, the sorting logic is optimized to be O(log n). Advantages of this design: -------------------------- 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and hence we avoid its associated problems (like too many zones, extra kswapd activity, question of choosing watermarks etc). [This is an advantage over the 'Hierarchy' design] 2. Performance overhead is expected to be low: Since we retain the simplicity of the algorithm in the page allocation path, page allocation can potentially remain as fast as it would be without memory regions. The overhead is pushed to the page-freeing paths which are not that critical. Light-weight targetted region compaction: ---------------------------------------- Over time, due to multiple alloc()s and free()s in random order, memory gets fragmented, which means the memory allocations will no longer be consolidated to a minimum no. of memory regions. In such cases we need a light-weight mechanism to opportunistically compact memory to evacuate lightly-filled memory regions, thereby enhancing the power-savings. Noting that CMA (Contiguous Memory Allocator) does targetted compaction to achieve its goals, this patchset generalizes the targetted compaction code and reuses it to evacuate memory regions. The region evacuation is triggered by the page allocator : when it notices the first page allocation in a new region, it sets up a worker function to perform compaction and evacuate that region in the future, if possible. There are handshakes between the alloc and the free paths in the page allocator to help do this smartly, which are explained in detail in the patches. This patchset has been hosted in the below git tree. It applies cleanly on v3.9-rc5. git://github.com/srivatsabhat/linux.git mem-power-mgmt-v2 Changes in this v2: ================== * Fixed a bug in the NUMA case. * Added a new optimized O(log n) sorting algorithm to speed up region-sorting of the buddy freelists (patch 9). The efficiency of this new algorithm and its design allows us to support large amounts of RAM quite easily. * Added light-weight targetted compaction/reclaim support for memory power management (patches 10-14). * Revamped the cover-letter to better explain the idea behind memory power management and this patchset. Experimental Results: ==================== Test setup: ---------- x86 dual-socket quad core HT-enabled machine booted with mem=8G Memory region size = 512 MB Functional testing: ------------------ Ran pagetest, a simple C program that allocates and touches a required number of pages. Below is the statistics from the regions within ZONE_NORMAL, at various sizes of allocations from pagetest. Present pages | Free pages at various allocation sizes | | start | 512 MB | 1024 MB | 2048 MB | Region 0 1 | 0 | 0 | 0 | 0 | Region 1 131072 | 41537 | 13858 | 13790 | 13334 | Region 2 131072 | 131072 | 26839 | 82 | 122 | Region 3 131072 | 131072 | 131072 | 26624 | 0 | Region 4 131072 | 131072 | 131072 | 131072 | 0 | Region 5 131072 | 131072 | 131072 | 131072 | 26624 | Region 6 131072 | 131072 | 131072 | 131072 | 131072 | Region 7 131072 | 131072 | 131072 | 131072 | 131072 | Region 8 131071 | 72704 | 72704 | 72704 | 72704 | This shows that page allocation occurs in the order of increasing region numbers, as intended in this design. Performance impact: ------------------- Kernbench results didn't show any noticeable performance degradation with this patchset as compared to vanilla 3.9-rc5. Todos and ideas for enhancing the design further: ================================================ 1. Add support for making this work with sparsemem, memcg etc. 2. Mel Gorman pointed out that regular compaction algorithm would work against the sorted-buddy allocation strategy, since it creates free space at lower pfns. For now, I have not handled this because regular compaction triggers only when the memory pressure is very high, and hence memory power management is pointless in those situations. Besides, it is immaterial whether memory allocations are consolidated towards lower or higher pfns, because it saves power either way, and hence the regular compaction algorithm doesn't actually work against memory power management. 3. Add more optimizations to the targetted region compaction algorithm in order to enhance its benefits and reduce the overhead, such as: a. Migrate only active pages during region evacuation, because, strictly speaking we only want to avoid _references_ to the region. So inactive pages can be kept around, thus reducing the page-migration overhead. b. Reduce the search-space for region evacuation, by having the page-allocator note down the highest allocated pfn within that region. 4. Have stronger influence over how freepages from different migratetypes are exchanged, so that unmovable and non-reclaimable allocations are contained within least no. of memory regions. 5. Influence the refill of per-cpu pagesets and perhaps even heavily used slab caches, such that they all get their memory from least no. of memory regions. This is to avoid frequent fragmentation of memory regions. 6. Don't perform region evacuation at situations of high memory utilization. Also, never use freepages from MIGRATE_RESERVE for the purpose of region-evacuation. 7. Add more tracing/debug info to enable better evaluation of the effectiveness and benefits of this patchset over vanilla kernel. 8. Add a higher level policy to control the aggressiveness of memory power management. References: ---------- [1]. Review comments suggesting modifying the buddy allocator to be aware of memory regions: http://article.gmane.org/gmane.linux.power-management.general/24862 http://article.gmane.org/gmane.linux.power-management.general/25061 http://article.gmane.org/gmane.linux.kernel.mm/64689 [2]. Patch series that implemented the node-region-zone hierarchy design: http://lwn.net/Articles/445045/ http://thread.gmane.org/gmane.linux.kernel.mm/63840 Summary of the discussion on that patchset: http://article.gmane.org/gmane.linux.power-management.general/25061 Forward-port of that patchset to 3.7-rc3 (minimal x86 config) http://thread.gmane.org/gmane.linux.kernel.mm/89202 [3]. Disadvantages of having memory regions in the hierarchy between nodes and zones: http://article.gmane.org/gmane.linux.kernel.mm/63849 [4]. Estimate of potential power savings on Samsung exynos board http://article.gmane.org/gmane.linux.kernel.mm/65935 [5]. ACPI 5.0 and MPST support http://www.acpi.info/spec.htm Section 5.2.21 Memory Power State Table (MPST) [6]. v1 of Sorted-buddy memory power management patchset: http://thread.gmane.org/gmane.linux.power-management.general/28498 Srivatsa S. Bhat (15): mm: Introduce memory regions data-structure to capture region boundaries within nodes mm: Initialize node memory regions during boot mm: Introduce and initialize zone memory regions mm: Add helpers to retrieve node region and zone region for a given page mm: Add data-structures to describe memory regions within the zones' freelists mm: Demarcate and maintain pageblocks in region-order in the zones' freelists mm: Add an optimized version of del_from_freelist to keep page allocation fast bitops: Document the difference in indexing between fls() and __fls() mm: A new optimized O(log n) sorting algo to speed up buddy-sorting mm: Add support to accurately track per-memory-region allocation mm: Restructure the compaction part of CMA for wider use mm: Add infrastructure to evacuate memory regions using compaction mm: Implement the worker function for memory region compaction mm: Add alloc-free handshake to trigger memory region compaction mm: Print memory region statistics to understand the buddy allocator behavior arch/x86/include/asm/bitops.h | 4 include/asm-generic/bitops/__fls.h | 5 include/linux/compaction.h | 7 include/linux/gfp.h | 2 include/linux/migrate.h | 3 include/linux/mm.h | 62 ++++ include/linux/mmzone.h | 78 ++++- include/trace/events/migrate.h | 3 mm/compaction.c | 149 +++++++++ mm/internal.h | 40 ++ mm/page_alloc.c | 617 ++++++++++++++++++++++++++++++++---- mm/vmstat.c | 36 ++ 12 files changed, 935 insertions(+), 71 deletions(-) Regards, Srivatsa S. Bhat IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>