Hi Mike This code is targeted at application usage. Developers load an appropriate environment module then compile and link ( see attached man page ) As can be seen in this document, the features were initially developed to support proprietary Cray High Speed Networks ( Gemini and Aries ) and associated PGAS or SHMEM programming models. In this respect, it can be regarded as legacy code that has not kept pace with recent developments. It has been challenging to determine the relevance of this feature to current hardware and applications. My requests to developers and benchmark specialists for information about the benefits it provides have not revealed much specific data. There is a general impression that it would be useful for GPU applications and MPI in large HPC systems but I suspect that it would require some very advanced knowledge of memory management for a developer to know precisely how and when to apply it. This is the first time I have heard of the folio abstraction as the future for memory management. When you mention that future hugetbls work will be based on that concept, it seems unlikely that there would be interest in code that is not consistent with those developments. I also doubt that there would be a justification to 'update' the code to be consistent with future kernel developments. I am therefore forming the impression that this idea may not be of interest to the Linux kernel community, however, I do not the detailed technical depth of the development team. Do you have some more information about this folio abstraction plan ? Des -----Original Message----- From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Sent: Friday, July 22, 2022 11:12 AM To: Albert, Des <des.albert@xxxxxxx> Cc: songmuchun@xxxxxxxxxxxxx; linux-mm@xxxxxxxxx Subject: Re: Additional Huge Pages On 07/22/22 17:20, Albert, Des wrote: > Hi > > I am the Product Manager for the HPE Cray Operating System ( formerly > Cray Linux Environment ) > > One of the features of this product is a component known as additional huge pages. This is kernel code that enables the selection of 'non-standard' huge page sizes. > For example, the current implementation allows for selection of huge page sizes of 2, 4, 8, 16, 32, 64, 128, 256 and 512 MB as well as 1 and 2 GB. > Interesting. Are these non-standard huge pages sizes targeted at application usage, or internal kernel APIs. If applications, what API is used? Is it similar/the same as hugetlb? Within the kernel, support for 'arbitrary page sizes' is provided by the folio abstraction. hugetlb code will be moving to that in the future. Any new code such as this whould be based on folios. > We are currently evaluating the concept of providing this code to kernel.org. I realize that this would require dedication of technical resources to work with maintainers. > > I would like to know if there is interest in this suggestion. I realize that Transparent Huge Pages may be regarded as a more general approach to this requirement. > I guess interest would depend on the use cases and potential advantages of this feature. You should be able to speak to this based on your current usage. -- Mike Kravetz
intro_hugepages(1) General Commands Manual intro_hugepages(1) NAME intro_hugepages - Introduction to using huge pages IMPLEMENTATION Cray Linux Environment (CLE) DESCRIPTION Huge pages are virtual memory pages which are bigger than the default base page size of 4Kbytes. Huge pages can improve memory performance for common access patterns on large data sets. Huge pages also increase the maximum size of data and text in a program accessible by the high speed network. Access to huge pages is provided through a virtual file system called hugetlbfs. Every file on this file system is backed by huge pages and is directly accessed with mmap() or read(). The libhugetlbfs library allows an application to use huge pages more easily than it could by directly accessing the hugetlbfs filesystem. A user may use libhugetlbfs to back application text and data segments. For definitions of terms used in this man page, see Terms. Module Support Module files set the necessary link options and run time environment variables to facilitate the usage of the huge page size indicated by the module name. Gemini systems: craype-hugepages128K, craype-hugepages512K, craype-hugepages2M, craype-hugepages8M, craype-hugepages16M, craype-hugepages64M. Aries systems: craype-hugepages2M, craype-hugepages4M, craype-hugepages8M, craype-hugepages16M, craype-hugepages32M, craype-hugepages64M, craype-hugepages128M, craype-hugepages256M , craype-hugepages512M, craype-hugepages1G, and craype- hugepages2G. To compile a Unified Parallel C application that uses 2 M huge pages: module load PrgEnv-cray module load craype-hugepages2M cc -h upc -c array_upc.c cc -h upc -o array_upc.x array_upc.o To see the link options and run time environment variables set by these modules: module show module_name Note that the value of HUGETLB_DEFAULT_PAGE_SIZE varies between craype-hugepages modules. Also note that the name of the HUGETLB<size>_POST_LINK_OPTS variable varies between modules, but it's value is the same. setenv HUGETLB_DEFAULT_PAGE_SIZE <size> setenv HUGETLB_MORECORE yes setenv HUGETLB_ELFMAP W setenv HUGETLB_FORCE_ELFMAP yes+ setenv HUGETLB<size>_POST_LINK_OPTS "-Wl,\ --whole-archive,-lhugetlbfs,--no-whole-archive -Wl,-Ttext-segment=address,-zmax-page-size=size" The HUGETLB<size>_POST_LINK_OPTS value is relevant to the creation of the executable, while the others are run time environment variables. A user may choose to run an application with a different craype-hugepages module than was used at compile and link time. To make most efficient use of available memory, use the smallest huge page size necessary for the application. The link options -Wl,-Ttext-segment=address,-zmax-page-size=size enforce the alignment and starting addresses of segments so that there are separate read-execute (text) and read-write (data and bss) segments for all pages sizes up to the maximum of 64M for Gemini and 512M for Aries. This causes libhugetlbfs to avoid overlapping read-execute text with read- write data/bss on huge pages, which would cause a segment to be both writable and executable. Note: The current versions of all the hugepages modules use a 512M alignment and max-page-size so that a statically linked executable may run using a variety of HUGETLB_DEFAULT_PAGE_SIZEs without having to relink; however, this may not be appropriate for certain situations. Specifically, suppose the statically linked application allocates a large amount of static data (greater than 2GiB) in the form of initialized arrays and the 32M hugepage module sets -Ttext-segment=0x20000000,-zmax-page-size=0x20000000 (512M alignment). The combined static memory requirement (text+data), plus the memory padding that is added by the linker for 512M alignment, may cause relocation addresses to exceed 4GiB. If this occurs, the user will see "relocation truncated to fit" errors. To remedy this, select the smallest craype-hugepages module needed by the job, and then reset the alignment by resetting the HUGETLB<size>_POST_LINK_OPTS environment variable before linking. For example, if an 8M page size is sufficiently large for the application, load the craype-hugepages8M module and then set the text-segment and max- page-size to 8MB before compiling and linking: module load craype-hugepages8M setenv HUGETLB8M_POST_LINK_OPTS "-Wl,--whole-archive,-lhugetlbfs,--no-whole-archive \ -Wl,-Ttext-segment=0x800000,-zmax-page-size=0x800000" -------------------------------------------------------------- Page Size text-segment/max-page-size settings -------------------------------------------------------------- 2M -Ttext-segment=0x200000,-zmax-page-size=0x200000 4M -Ttext-segment=0x400000,-zmax-page-size=0x400000 8M -Ttext-segment=0x800000,-zmax-page-size=0x800000 16M -Ttext-segment=0x1000000,-zmax-page-size=0x1000000 -------------------------------------------------------------- Note: The run time environment variables set by these modules are relevant on compute nodes, not on service nodes. If the user is running the application on a service node instead of a compute node, they should unload the hugepage module before execution. When to Use Huge Pages · For SHMEM applications, map the static data and/or private heap onto huge pages. · For applications written in Unified Parallel C, Coarray Fortran, and other languages based on the PGAS programming model, map the static data and/or private heap onto huge pages. · For MPI applications, map the static data and/or heap onto huge pages. · For an application which uses shared memory, which needs to be concurrently registered with the high speed network drivers for remote communication. · For an application doing heavy I/O. · To improve memory performance for common access patterns on large data sets. When to Avoid Using Huge Pages Applications sometimes consist of many steering programs in addition to the core application. Applying huge page behavior to all processes would not provide any benefit and would consume huge pages that would otherwise benefit the core application. See HUGETLB_RESTRICT_EXE described in ENVIRONMENT VARIABLES. ENVIRONMENT VARIABLES The following variables affect huge pages: XT_SYMMETRIC_HEAP_SIZE The symmetric heap always uses huge pages, regardless of whether or not a hugepage module is loaded. For PGAS applications using UPC or Coarray Fortran, if XT_SYMMETRIC_HEAP_SIZE is not set, the default symmetric heap per PE for a PGAS application is 64M. Therefore, if a Coarray Fortran application requires 1000M per PE and the user does not set XT_SYMMETRIC_HEAP_SIZE, one of the coarray allocate statements will fail to find enough memory. The symmetric heap is reserved at program launch and its size does not change. For PGAS applications using SHMEM, either XT_SYMMETRIC_HEAP_SIZE or SMA_SYMMETRIC_SIZE should be used to set the size of the symmetric heap. Cray XC series systems support a growable symmetric heap, so if XT_SYMMETRIC_HEAP_SIZE or SMA_SYMMETRIC_SIZE is not set, the symmetric heap grows dynamically as needed to a maximum of 2GB per PE. (Cray XE and Cray XK series systems do not support growable symmetric heap and have no default symmetric heap value.) The aprun -m option does not change the size of the symmetric heap allocated by UPC or Fortran applications upon startup. The -m option refers to the total amount of memory available to a PE, which includes all memory and not just the symmetric heap. Use -m option only if necessary. The following variables affect libhugetlbfs: HUGETLB_DEFAULT_PAGE_SIZE Override the system default huge page size for all uses except the hugetlbfs-backed symmetric heap used by SHMEM and PGAS programming models. The default huge page size is 2M. Additionally supported on Gemini systems: 128K, 512K , 8M, 16M, 64M. Additionally supported on Aries systems: 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, 1GB, 2GB. HUGETLB_ELFMAP Set to W to map the read-write sections (writable static data, bss) onto huge pages. Set to R to map the read-execute segment (text, read-only static data) onto huge pages. Set to RW to map both onto huge pages. HUGETLB_FORCE_ELFMAP If set to yes, and LD_PRELOAD contains libhugetlbfs.so, then libhugetlbfs will load all parts of the text, data and bss that fall on huge page boundaries onto huge pages. The parts of the text and data and bss sections that do not fall into whole huge pages (e.g. the "edges") are left on 4K pages. If set to yes+ (Cray extension), then all of the text and/or data and bss (per direction of HUGETLB_ELFMAP) will be mapped onto huge pages, including the "edges". Note that the Cray extension works for both static and dynamic executables and does not depend on LD_PRELOAD having libhugetlbfs.so in it. If there is an overlap of the read-execute and the read-write sections, then a new mapping for the overlap will be made with combined permissions (i.e. RWX). Using the link option specified in the craype-hugepages modules avoids this overlap. HUGETLB_MORECORE Set to yes to map the heap (also relates to the private heap in SHMEM applications) onto huge pages. Enables malloc() to use memory backed by huge pages automatically. HUGETLB_RESTRICT_EXE=exe1[:exe2:exe3:...] Selectively enables libhugetlbfs to map only the named executables onto huge pages. The executables are named by the last component of the pathname; use a colon to separate the names of multiple executables. For example, if your executable is /lus/home/user/bin/mytest.x, specify: HUGETLB_RESTRICT_EXE=mytest.x HUGETLB_VERBOSE The range of the value is from 0 to 99. Setting to a nonzero number causes libhugetlbfs to print out informational messages. A value of 99 prints out all available information. NOTES Gemini NIC There are two hardware mechanisms used by the Gemini NIC to translate virtual to physical memory references on the Cray XE and Cray XK systems. GNI and DMAPP are low level libraries which provide communication services to user level software and implement a logically shared, distributed memory programming model. · GART is a feature of many AMD64 processors that allows the system to access virtually contiguous user pages that are backed by non-contiguous physical pages. The GART aggregates the Linux standard 4 Kbyte pages into larger virtually contiguous memory regions. The contiguous pages exist in a portion of the physical address space known as the Graphics Aperture. The GART's graphics aperture size is 2GiB. Therefore, the total memory which can be referenced through GART is limited to 2GiB per compute node. · The Memory Relocation Table (MRT) on the Gemini NIC maps the memory references contained in incoming network packets to physical memory on the local node. Memory references through the MRT map to a much larger address range than they do through the GART. Each NIC has its own MRT. MRT page sizes range from 128 K to 1 Gbyte, but all the entries on a given node must have the same page size. The MRT entries are created by kGNI in response to requests from the application, usually the uGNI library. There are 16K MRT entries. The default MRT page size is 2Mbytes, which maps to 32Gbytes (16K*2M). HUGETLB_DEFAULT_PAGE_SIZE sets the MRT page size. Depending on the size of the allocated memory region and other default behavior, the memory registration function (of GNI/DMAPP) asks the kernel to create either GART entries on the AMD processor, or, in the case of huge pages, create entries in the Memory Relocation Table (MRT) on the NIC, to span the allocated memory region. User virtual memory that is to be read or written across nodes, generally must first be registered on the node; its physical location(s) and extent(s) loaded into the Gemini Memory Descriptor Table (MDD) and either the Opteron GART or the Gemini MRT. Required GART Address Translation: Lustre I/O uses the GART. The Lustre Network Driver (LND) uses 1 Mbyte buffers, constructed out of smaller pages using the GART. DVS uses the GART. Required MRT Address Translation: User virtual memory mapped by huge pages (via a hugetlbfs file system) will be registered in the MRT. DMAPP mmaps the symmetric heap directly, regardless of its size, to the hugetlbfs file system if it is mounted, which it normally is on Cray XE systems. So, any application using DMAPP (e.g. SHMEM, PGAS programming models) will use MRT for memory references within the symmetric heap. The symmetric heap always uses huge pages, regardless of whether a hugepages module is loaded. Note that the libhugetlbfs library is not used in this case. The value of HUGETLB_DEFAULT_PAGE_SIZE determines the page size for the symmetric heap but the other HUGETLB environment variables have no effect. When an application's memory requirements, (specifically memory which is mapped through the HSN), exceeds the GART aperture size (2GiB) on a single node, the application must be linked with the libhugetlbfs library, to use the larger address range available with huge pages. Default Behavior If Not Using craype-hugepages Modules: If there is no craype-hugepages module loaded and if none of the HUGETLB environment variables are set, by default the symmetric heap (in the case of SHMEM or PGAS programming models) is mapped onto huge pages but most other memory is mapped onto base pages which uses GART. Considering the 2GiB GART per node limit which is shared between application PEs on a node, Lustre and DVS, it is advisable to map the static data section and private heap onto huge pages. This can be selectively changed by using the proper link options and setting the environment variables HUGETLB_ELFMAP=W, and HUGETLB_MORECORE=yes. Aries NIC In Cray systems which have the Aries NIC, the Aries IO Memory Management Unit (IOMMU) provides hardware support for memory protection and address translation. The Aries IOMMU uses an entirely different memory translation mechanism than Gemini uses: · The IOMMU is divided into 16 translation context registers (TCRs). Each translation context (TC) supports a single page size. The TCRs can independently address different page sizes and present that to the network as a contiguous memory domain. The TCR entries are used to set and clear the page table entries (PTEs) used by GNI. PTE entries are cached in Aries NIC memory in a page table. Up to 512 PTEs can be used by applications. 512MiB (largest hugepage size) x 512 PTEs = 256GiB of addressable memory per node on Aries systems. Other Notes on Memory Usage Huge pages benefit applications which have a huge working set size (hundreds of Mbytes or many Gbytes and above) since this would require many virtual to physical address translations if using the default 4K pages. By using huge pages, the number of required address translations is decreased which benefits application performance by removing the wait time to fill up the TLB caches with translation data. Larger pages increase memory reach but may also exhaust available memory quicker. Thus, the optimal page size may vary from application to application. With hugepages, an application is still limited by the total memory on the node. Also memory fragmentation can decrease available memory. See ISSUES. The /proc/meminfo file does not give a complete picture of huge page usage and is deprecated for this purpose. Running Independent Software Vendor (ISV) Applications To enable a dynamically linked executable, that was not originally linked with libhugetlbfs, to use Cray's libhugetlbfs library at runtime, you must first load a hugepages module and set the environment variable LD_PRELOAD so that it contains the libhugetlbfs pathname: module load craype-hugepages2M export LD_PRELOAD=/usr/lib64/libhugetlbfs.so If an ISV application is already using LD_PRELOAD to set dynamic library dependencies, then use a white-space separated list. For example: export LD_PRELOAD="/usr/lib64/libhugetlbfs.so /directory_name/lib.so" To confirm the usage of hugepages, one may set HUGETLB_VERBOSE to 3 or higher: export HUGETLB_VERBOSE=3 Statically linked executables can only use Cray's libhugetlbfs if they are linked with it. Statically linked executables do not process LD_PRELOAD; therefore statically linked ISVs must be relinked with libhugetlbfs. See Module Support for compiling and linking. The nm and ldd commands are useful for determining the contents and dynamic dependencies of executables. Selective Mapping ISV applications sometimes consist of scripts which run several executables, only some of which need to run with huge pages. The environment variable HUGETLB_RESTRICT_EXE enables the libhugetlbfs library to selectively map only the named executables onto huge pages. Terms Text Segment - contains the actual instructions to be executed. Data Segment - contains the program's data part, which is further divided into data, bss, and heap sections. · Data- global, static initialized data. · BSS - global, static uninitialized data. · Heap - dynamically allocated memory. Stack - used for local variables, stack frames. Symmetric Heap - contains dynamically allocated memory for a PE, which is kept in sync by the programming model (e.g. SHMEM) with that of another PE. See intro_shmem(3) man page for additional information. The private heap contains dynamically allocated memory which is specific to a PE. GART - Graphics Aperture Relocation Table HSN - High Speed Network IOMMU - High I/O Memory Management Unit ISV - Independent Software Vendor MRT - Memory Relocation Table TLB - Translation Look Aside Buffer is the memory management hardware uses to translate virtual addresses into physical addresses. ISSUES Huge pages are a per-node resource, not a per-job resource, nor a per-process resource. There is no guarantee that the requested number of huge pages will be available on the compute nodes. If the memory pool becomes fragmented, which it can over time, the number of free blocks that are equal to or larger than the huge page size can decrease below the number needed to service the request, even though there may be enough free memory in the pool when summing free blocks of all sizes. For this reason, use huge page sizes no larger than needed. If the heap is mapped to huge pages (by setting HUGETLB_MORECORE to yes) and if a malloc call requires that the heap be extended, and if there are not enough free blocks in the memory pool large enough to support the required number of huge pages, libhugetlbfs will issue the following WARNING message and then glibc will fall back to allocating base pages. libhugetlbfs [nid000xx:xxxxx]: WARNING: New heap segment map at 0x10000000 failed: Cannot allocate memory Since this is a warning, jobs are able to continue running after this message occurs. But because the allocated base pages use GART entries, and as described in the NOTES section, and there are a limited number of GART entries, future memory requests may fail altogether due to lack of available GART entries. With craype-hugepages modules loaded, it is no longer necessary to include -lhugetlbfs on the link line. Doing so will result in messages indicating multiple definitions, such as: //usr/lib64/libhugetlbfs.a(elflink.o): In function `__libhugetlbfs_do_remap_segments': /usr/src/packages/BUILD/cray-libhugetlbfs-2.11/elflink.c:2012: multiple definition of `__libhugetlbfs_do_remap_segments' //usr/lib64/libhugetlbfs.a(elflink.o):/usr/src/packages/BUILD/ cray-libhugetlbfs-2.11/elflink.c:2012: first defined here Adjust makefiles or build scripts accordingly. SEE ALSO hugeadm(8), cc(1), CC(1), ftn(1), aprun(1), intro_mpi(3), intro_shmem(3), libhugetlbfs(7) /usr/share/doc/libhugetlbfs/HOWTO 03-08-2019 intro_hugepages(1)