RE: Additional Huge Pages

"Albert, Des" <des.albert@xxxxxxx> · Fri, 22 Jul 2022 19:20:51 +0000

Hi Mike

This code is targeted at application usage. Developers load an appropriate environment module then compile and link ( see attached man page )

As can be seen in this document, the features were initially developed to support proprietary Cray High Speed Networks ( Gemini and Aries ) and associated PGAS or SHMEM programming models. In this respect, it can be regarded as legacy code that has not kept pace with recent developments.

It has been challenging to determine the relevance of this feature to current hardware and applications. My requests to developers and benchmark specialists for information about the benefits it provides have not revealed much specific data. There is a general impression that it would be useful for GPU applications and MPI in large HPC systems but I suspect that it would require some very advanced knowledge of memory management for a developer to know precisely how and when to apply it.

This is the first time I have heard of the folio abstraction as the future for memory management. When you mention that future hugetbls work will be based on that concept, it seems unlikely that there would be interest in code that is not consistent with those developments. I also doubt that there would be a justification to 'update' the code to be consistent with future kernel developments.

I am therefore forming the impression that this idea may not be of interest to the Linux kernel community, however, I do not the detailed technical depth of the development team.

Do you have some more information about this folio abstraction plan ?

Des

-----Original Message-----
From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> 
Sent: Friday, July 22, 2022 11:12 AM
To: Albert, Des <des.albert@xxxxxxx>
Cc: songmuchun@xxxxxxxxxxxxx; linux-mm@xxxxxxxxx
Subject: Re: Additional Huge Pages

On 07/22/22 17:20, Albert, Des wrote:
> Hi
> 
> I am the Product Manager for the HPE Cray Operating System ( formerly 
> Cray Linux Environment )
> 
> One of the features of this product is a component known as additional huge pages. This is kernel code that enables the selection of 'non-standard' huge page sizes.
> For example, the current implementation allows for selection of huge page sizes of 2, 4, 8, 16, 32, 64, 128, 256 and 512 MB as well as 1 and 2 GB.
> 

Interesting.
Are these non-standard huge pages sizes targeted at application usage, or internal kernel APIs.  If applications, what API is used?  Is it similar/the same as hugetlb?

Within the kernel, support for 'arbitrary page sizes' is provided by the folio abstraction.  hugetlb code will be moving to that in the future.
Any new code such as this whould be based on folios.

> We are currently evaluating the concept of providing this code to kernel.org. I realize that this would require dedication of technical resources to work with maintainers.
> 
> I would like to know if there is interest in this suggestion. I realize that Transparent Huge Pages may be regarded as a more general approach to this requirement.
> 

I guess interest would depend on the use cases and potential advantages of this feature.  You should be able to speak to this based on your current usage.
--
Mike Kravetz
intro_hugepages(1)				     General Commands Manual				      intro_hugepages(1)

NAME
       intro_hugepages - Introduction to using huge pages

IMPLEMENTATION
       Cray Linux Environment (CLE)

DESCRIPTION
       Huge pages are virtual memory pages which are bigger than the default base page size of 4Kbytes. Huge pages can improve
       memory performance for common access patterns on large data sets. Huge pages also increase the maximum size of data and
       text in a program accessible by the high speed network. Access to huge pages is provided through a virtual file system
       called hugetlbfs. Every file on this file system is backed by huge pages and is directly accessed with mmap() or read().

       The libhugetlbfs library allows an application to use huge pages more easily than it could by directly accessing the
       hugetlbfs filesystem. A user may use libhugetlbfs to back application text and data segments.

       For definitions of terms used in this man page, see Terms.

   Module Support
       Module files set the necessary link options and run time environment variables to facilitate the usage of the huge page
       size indicated by the module name.

       Gemini systems: craype-hugepages128K, craype-hugepages512K, craype-hugepages2M, craype-hugepages8M, craype-hugepages16M,
       craype-hugepages64M.

       Aries systems: craype-hugepages2M, craype-hugepages4M, craype-hugepages8M, craype-hugepages16M, craype-hugepages32M,
       craype-hugepages64M, craype-hugepages128M, craype-hugepages256M , craype-hugepages512M, craype-hugepages1G, and craype-
       hugepages2G.

       To compile a Unified Parallel C application that uses 2 M huge pages:

	 module load PrgEnv-cray
	 module load craype-hugepages2M
	 cc -h upc -c array_upc.c
	 cc -h upc -o array_upc.x array_upc.o

       To see the link options and run time environment variables set by these modules:

	 module show module_name

       Note that the value of HUGETLB_DEFAULT_PAGE_SIZE varies between craype-hugepages modules. Also note that the name of the
       HUGETLB<size>_POST_LINK_OPTS variable varies between modules, but it's value is the same.

	 setenv HUGETLB_DEFAULT_PAGE_SIZE <size>
	 setenv HUGETLB_MORECORE yes
	 setenv HUGETLB_ELFMAP W
	 setenv HUGETLB_FORCE_ELFMAP yes+
	 setenv HUGETLB<size>_POST_LINK_OPTS "-Wl,\
	 --whole-archive,-lhugetlbfs,--no-whole-archive -Wl,-Ttext-segment=address,-zmax-page-size=size"

       The HUGETLB<size>_POST_LINK_OPTS value is relevant to the creation of the executable, while the others are run time
       environment variables. A user may choose to run an application with a different craype-hugepages module than was used at
       compile and link time. To make most efficient use of available memory, use the smallest huge page size necessary for the
       application.

       The link options -Wl,-Ttext-segment=address,-zmax-page-size=size enforce the alignment and starting addresses of segments
       so that there are separate read-execute (text) and read-write (data and bss) segments for all pages sizes up to the
       maximum of 64M for Gemini and 512M for Aries. This causes libhugetlbfs to avoid overlapping read-execute text with read-
       write data/bss on huge pages, which would cause a segment to be both writable and executable.

	      Note:  The current versions of all the hugepages modules use a 512M alignment and max-page-size so that a
	      statically linked executable may run using a variety of HUGETLB_DEFAULT_PAGE_SIZEs without having to relink;
	      however, this may not be appropriate for certain situations. Specifically, suppose the statically linked
	      application allocates a large amount of static data (greater than 2GiB) in the form of initialized arrays and the
	      32M hugepage module sets -Ttext-segment=0x20000000,-zmax-page-size=0x20000000 (512M alignment). The combined
	      static memory requirement (text+data), plus the memory padding that is added by the linker for 512M alignment, may
	      cause relocation addresses to exceed 4GiB. If this occurs, the user will see "relocation truncated to fit" errors.
	      To remedy this, select the smallest craype-hugepages module needed by the job, and then reset the alignment by
	      resetting the HUGETLB<size>_POST_LINK_OPTS environment variable before linking. For example, if an 8M page size is
	      sufficiently large for the application, load the craype-hugepages8M module and then set the text-segment and max-
	      page-size to 8MB before compiling and linking:

		module load craype-hugepages8M
		setenv HUGETLB8M_POST_LINK_OPTS "-Wl,--whole-archive,-lhugetlbfs,--no-whole-archive \
		-Wl,-Ttext-segment=0x800000,-zmax-page-size=0x800000"

		--------------------------------------------------------------
		Page Size  text-segment/max-page-size settings
		--------------------------------------------------------------
		2M	   -Ttext-segment=0x200000,-zmax-page-size=0x200000
		4M	   -Ttext-segment=0x400000,-zmax-page-size=0x400000
		8M	   -Ttext-segment=0x800000,-zmax-page-size=0x800000
		16M	   -Ttext-segment=0x1000000,-zmax-page-size=0x1000000
		--------------------------------------------------------------

	      Note:  The run time environment variables set by these modules are relevant on compute nodes, not on service
	      nodes. If the user is running the application on a service node instead of a compute node, they should unload the
	      hugepage module before execution.

   When to Use Huge Pages
       Â·  For SHMEM applications, map the static data and/or private heap onto huge pages.

       Â·  For applications written in Unified Parallel C, Coarray Fortran, and other languages based on the PGAS programming
	  model, map the static data and/or private heap onto huge pages.

       Â·  For MPI applications, map the static data and/or heap onto huge pages.

       Â·  For an application which uses shared memory, which needs to be concurrently registered with the high speed network
	  drivers for remote communication.

       Â·  For an application doing heavy I/O.

       Â·  To improve memory performance for common access patterns on large data sets.

   When to Avoid Using Huge Pages
       Applications sometimes consist of many steering programs in addition to the core application. Applying huge page behavior
       to all processes would not provide any benefit and would consume huge pages that would otherwise benefit the core
       application. See HUGETLB_RESTRICT_EXE described in ENVIRONMENT VARIABLES.

ENVIRONMENT VARIABLES
       The following variables affect huge pages:

       XT_SYMMETRIC_HEAP_SIZE
		 The symmetric heap always uses huge pages, regardless of whether or not a hugepage module is loaded.

		 For PGAS applications using UPC or Coarray Fortran, if XT_SYMMETRIC_HEAP_SIZE is not set, the default symmetric
		 heap per PE for a PGAS application is 64M. Therefore, if a Coarray Fortran application requires 1000M per PE
		 and the user does not set XT_SYMMETRIC_HEAP_SIZE, one of the coarray allocate statements will fail to find
		 enough memory. The symmetric heap is reserved at program launch and its size does not change.

		 For PGAS applications using SHMEM, either XT_SYMMETRIC_HEAP_SIZE or SMA_SYMMETRIC_SIZE should be used to set
		 the size of the symmetric heap. Cray XC series systems support a growable symmetric heap, so if
		 XT_SYMMETRIC_HEAP_SIZE or SMA_SYMMETRIC_SIZE is not set, the symmetric heap grows dynamically as needed to a
		 maximum of 2GB per PE. (Cray XE and Cray XK series systems do not support growable symmetric heap and have no
		 default symmetric heap value.)

		 The aprun -m option does not change the size of the symmetric heap allocated by UPC or Fortran applications
		 upon startup. The -m option refers to the total amount of memory available to a PE, which includes all memory
		 and not just the symmetric heap. Use -m option only if necessary.

       The following variables affect libhugetlbfs:

       HUGETLB_DEFAULT_PAGE_SIZE
		 Override the system default huge page size for all uses except the hugetlbfs-backed symmetric heap used by
		 SHMEM and PGAS programming models. The default huge page size is 2M.

		 Additionally supported on Gemini systems: 128K, 512K , 8M, 16M, 64M.

		 Additionally supported on Aries systems: 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, 1GB, 2GB.

       HUGETLB_ELFMAP
		 Set to W to map the read-write sections (writable static data, bss) onto huge pages.

		 Set to R to map the read-execute segment (text, read-only static data) onto huge pages.

		 Set to RW to map both onto huge pages.

       HUGETLB_FORCE_ELFMAP
		 If set to yes, and LD_PRELOAD contains libhugetlbfs.so, then libhugetlbfs will load all parts of the text, data
		 and bss that fall on huge page boundaries onto huge pages. The parts of the text and data and bss sections that
		 do not fall into whole huge pages (e.g. the "edges") are left on 4K pages.

		 If set to yes+ (Cray extension), then all of the text and/or data and bss (per direction of HUGETLB_ELFMAP)
		 will be mapped onto huge pages, including the "edges". Note that the Cray extension works for both static and
		 dynamic executables and does not depend on LD_PRELOAD having libhugetlbfs.so in it.

		 If there is an overlap of the read-execute and the read-write sections, then a new mapping for the overlap will
		 be made with combined permissions (i.e. RWX). Using the link option specified in the craype-hugepages modules
		 avoids this overlap.

       HUGETLB_MORECORE
		 Set to yes to map the heap (also relates to the private heap in SHMEM applications) onto huge pages. Enables
		 malloc() to use memory backed by huge pages automatically.

       HUGETLB_RESTRICT_EXE=exe1[:exe2:exe3:...]
		 Selectively enables libhugetlbfs to map only the named executables onto huge pages. The executables are named
		 by the last component of the pathname; use a colon to separate the names of multiple executables. For example,
		 if your executable is /lus/home/user/bin/mytest.x, specify:

		 HUGETLB_RESTRICT_EXE=mytest.x

       HUGETLB_VERBOSE
		 The range of the value is from 0 to 99. Setting to a nonzero number causes libhugetlbfs to print out
		 informational messages. A value of 99 prints out all available information.

NOTES
   Gemini NIC
       There are two hardware mechanisms used by the Gemini NIC to translate virtual to physical memory references on the Cray
       XE and Cray XK systems. GNI and DMAPP are low level libraries which provide communication services to user level software
       and implement a logically shared, distributed memory programming model.

       Â·  GART is a feature of many AMD64 processors that allows the system to access virtually contiguous user pages that are
	  backed by non-contiguous physical pages. The GART aggregates the Linux standard 4 Kbyte pages into larger virtually
	  contiguous memory regions. The contiguous pages exist in a portion of the physical address space known as the Graphics
	  Aperture. The GART's graphics aperture size is 2GiB. Therefore, the total memory which can be referenced through GART
	  is limited to 2GiB per compute node.

       Â·  The Memory Relocation Table (MRT) on the Gemini NIC maps the memory references contained in incoming network packets
	  to physical memory on the local node. Memory references through the MRT map to a much larger address range than they
	  do through the GART. Each NIC has its own MRT. MRT page sizes range from 128 K to 1 Gbyte, but all the entries on a
	  given node must have the same page size. The MRT entries are created by kGNI in response to requests from the
	  application, usually the uGNI library. There are 16K MRT entries. The default MRT page size is 2Mbytes, which maps to
	  32Gbytes (16K*2M). HUGETLB_DEFAULT_PAGE_SIZE sets the MRT page size.

       Depending on the size of the allocated memory region and other default behavior, the memory registration function (of
       GNI/DMAPP) asks the kernel to create either GART entries on the AMD processor, or, in the case of huge pages, create
       entries in the Memory Relocation Table (MRT) on the NIC, to span the allocated memory region. User virtual memory that is
       to be read or written across nodes, generally must first be registered on the node; its physical location(s) and
       extent(s) loaded into the Gemini Memory Descriptor Table (MDD) and either the Opteron GART or the Gemini MRT.

       Required GART Address Translation: Lustre I/O uses the GART. The Lustre Network Driver (LND) uses 1 Mbyte buffers,
       constructed out of smaller pages using the GART. DVS uses the GART.

       Required MRT Address Translation:  User virtual memory mapped by huge pages (via a hugetlbfs file system) will be
       registered in the MRT.

       DMAPP mmaps the symmetric heap directly, regardless of its size, to the hugetlbfs file system if it is mounted, which it
       normally is on Cray XE systems. So, any application using DMAPP (e.g. SHMEM, PGAS programming models) will use MRT for
       memory references within the symmetric heap. The symmetric heap always uses huge pages, regardless of whether a hugepages
       module is loaded. Note that the libhugetlbfs library is not used in this case. The value of HUGETLB_DEFAULT_PAGE_SIZE
       determines the page size for the symmetric heap but the other HUGETLB environment variables have no effect.

       When an application's memory requirements, (specifically memory which is mapped through the HSN), exceeds the GART
       aperture size (2GiB) on a single node, the application must be linked with the libhugetlbfs library, to use the larger
       address range available with huge pages.

       Default Behavior If Not Using craype-hugepages Modules: If there is no craype-hugepages module loaded and if none of the
       HUGETLB environment variables are set, by default the symmetric heap (in the case of SHMEM or PGAS programming models) is
       mapped onto huge pages but most other memory is mapped onto base pages which uses GART. Considering the 2GiB GART per
       node limit which is shared between application PEs on a node, Lustre and DVS, it is advisable to map the static data
       section and private heap onto huge pages. This can be selectively changed by using the proper link options and setting
       the environment variables HUGETLB_ELFMAP=W, and HUGETLB_MORECORE=yes.

   Aries NIC
       In Cray systems which have the Aries NIC, the Aries IO Memory Management Unit (IOMMU) provides hardware support for
       memory protection and address translation. The Aries IOMMU uses an entirely different memory translation mechanism than
       Gemini uses:

       Â·  The IOMMU is divided into 16 translation context registers (TCRs). Each translation context (TC) supports a single
	  page size. The TCRs can independently address different page sizes and present that to the network as a contiguous
	  memory domain. The TCR entries are used to set and clear the page table entries (PTEs) used by GNI. PTE entries are
	  cached in Aries NIC memory in a page table. Up to 512 PTEs can be used by applications. 512MiB (largest hugepage size)
	  x 512 PTEs = 256GiB of addressable memory per node on Aries systems.

   Other Notes on Memory Usage
       Huge pages benefit applications which have a huge working set size (hundreds of Mbytes or many Gbytes and above) since
       this would require many virtual to physical address translations if using the default 4K pages. By using huge pages, the
       number of required address translations is decreased which benefits application performance by removing the wait time to
       fill up the TLB caches with translation data. Larger pages increase memory reach but may also exhaust available memory
       quicker. Thus, the optimal page size may vary from application to application.

       With hugepages, an application is still limited by the total memory on the node. Also memory fragmentation can decrease
       available memory. See ISSUES.

       The /proc/meminfo file does not give a complete picture of huge page usage and is deprecated for this purpose.

   Running Independent Software Vendor (ISV) Applications
       To enable a dynamically linked executable, that was not originally linked with libhugetlbfs, to use Cray's libhugetlbfs
       library at runtime, you must first load a hugepages module and set the environment variable LD_PRELOAD so that it
       contains the libhugetlbfs pathname:

	 module load craype-hugepages2M
	 export LD_PRELOAD=/usr/lib64/libhugetlbfs.so

       If an ISV application is already using LD_PRELOAD to set dynamic library dependencies, then use a white-space separated
       list. For example:

	 export LD_PRELOAD="/usr/lib64/libhugetlbfs.so /directory_name/lib.so"

       To confirm the usage of hugepages, one may set HUGETLB_VERBOSE to 3 or higher:

	 export HUGETLB_VERBOSE=3

       Statically linked executables can only use Cray's libhugetlbfs if they are linked with it. Statically linked executables
       do not process LD_PRELOAD; therefore statically linked ISVs must be relinked with libhugetlbfs. See Module Support for
       compiling and linking.

       The nm and ldd commands are useful for determining the contents and dynamic dependencies of executables.

   Selective Mapping
       ISV applications sometimes consist of scripts which run several executables, only some of which need to run with huge
       pages. The environment variable HUGETLB_RESTRICT_EXE enables the libhugetlbfs library to selectively map only the named
       executables onto huge pages.

   Terms
       Text Segment - contains the actual instructions to be executed.

       Data Segment - contains the program's data part, which is further divided into data, bss, and heap sections.

       Â·  Data- global, static initialized data.

       Â·  BSS - global, static uninitialized data.

       Â·  Heap - dynamically allocated memory.

       Stack - used for local variables, stack frames.

       Symmetric Heap - contains dynamically allocated memory for a PE, which is kept in sync by the programming model (e.g.
       SHMEM) with that of another PE. See intro_shmem(3) man page for additional information. The private heap contains
       dynamically allocated memory which is specific to a PE.

       GART - Graphics Aperture Relocation Table

       HSN - High Speed Network

       IOMMU - High I/O Memory Management Unit

       ISV - Independent Software Vendor

       MRT - Memory Relocation Table

       TLB - Translation Look Aside Buffer is the memory management hardware uses to translate virtual addresses into physical
       addresses.

ISSUES
       Huge pages are a per-node resource, not a per-job resource, nor a per-process resource. There is no guarantee that the
       requested number of huge pages will be available on the compute nodes. If the memory pool becomes fragmented, which it
       can over time, the number of free blocks that are equal to or larger than the huge page size can decrease below the
       number needed to service the request, even though there may be enough free memory in the pool when summing free blocks of
       all sizes. For this reason, use huge page sizes no larger than needed.

       If the heap is mapped to huge pages (by setting HUGETLB_MORECORE to yes) and if a malloc call requires that the heap be
       extended, and if there are not enough free blocks in the memory pool large enough to support the required number of huge
       pages, libhugetlbfs will issue the following WARNING message and then glibc will fall back to allocating base pages.

	 libhugetlbfs [nid000xx:xxxxx]: WARNING: New heap segment map at
	 0x10000000 failed: Cannot allocate memory

       Since this is a warning, jobs are able to continue running after this message occurs. But because the allocated base
       pages use GART entries, and as described in the NOTES section, and there are a limited number of GART entries, future
       memory requests may fail altogether due to lack of available GART entries.

       With craype-hugepages modules loaded, it is no longer necessary to include -lhugetlbfs on the link line. Doing so will
       result in messages indicating multiple definitions, such as:

	 //usr/lib64/libhugetlbfs.a(elflink.o): In function
	 `__libhugetlbfs_do_remap_segments':

	 /usr/src/packages/BUILD/cray-libhugetlbfs-2.11/elflink.c:2012:
	 multiple definition of `__libhugetlbfs_do_remap_segments'

	 //usr/lib64/libhugetlbfs.a(elflink.o):/usr/src/packages/BUILD/
	 cray-libhugetlbfs-2.11/elflink.c:2012: first defined here

       Adjust makefiles or build scripts accordingly.

SEE ALSO
       hugeadm(8), cc(1), CC(1), ftn(1), aprun(1), intro_mpi(3), intro_shmem(3), libhugetlbfs(7)

       /usr/share/doc/libhugetlbfs/HOWTO

							   03-08-2019					      intro_hugepages(1)