On 8/16/21 4:23 PM, Andrew Morton wrote: > On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > >> This is a resend of PATCHes sent here [4]. There was some discussion >> and interest when the RFC [5] was sent, but little after that. The >> resend is just a rebase of [4] to next-20210816 with a few typos in >> commmit messages fixed. >> >> Original Cover Letter >> --------------------- >> The concurrent use of multiple hugetlb page sizes on a single system >> is becoming more common. One of the reasons is better TLB support for >> gigantic page sizes on x86 hardware. In addition, hugetlb pages are >> being used to back VMs in hosting environments. >> >> When using hugetlb pages to back VMs in such environments, it is >> sometimes desirable to preallocate hugetlb pools. This avoids the delay >> and uncertainty of allocating hugetlb pages at VM startup. In addition, >> preallocating huge pages minimizes the issue of memory fragmentation that >> increases the longer the system is up and running. >> >> In such environments, a combination of larger and smaller hugetlb pages >> are preallocated in anticipation of backing VMs of various sizes. Over >> time, the preallocated pool of smaller hugetlb pages may become >> depleted while larger hugetlb pages still remain. In such situations, >> it may be desirable to convert larger hugetlb pages to smaller hugetlb >> pages. >> >> Converting larger to smaller hugetlb pages can be accomplished today by >> first freeing the larger page to the buddy allocator and then allocating >> the smaller pages. However, there are two issues with this approach: >> 1) This process can take quite some time, especially if allocation of >> the smaller pages is not immediate and requires migration/compaction. >> 2) There is no guarantee that the total size of smaller pages allocated >> will match the size of the larger page which was freed. This is >> because the area freed by the larger page could quickly be >> fragmented. >> >> To address these issues, introduce the concept of hugetlb page demotion. >> Demotion provides a means of 'in place' splitting a hugetlb page to >> pages of a smaller size. For example, on x86 one 1G page can be >> demoted to 512 2M pages. Page demotion is controlled via sysfs files. >> - demote_size Read only target page size for demotion > > Should this be "write only"? If not, I'm confused. > > If "yes" then "write only" would be a misnomer - clearly this file is > readable (looks at demote_size_show()). > It is read only and is there mostly as information for the user. When they demote a page, this is the size to which the page will be demoted. For example, # pwd /sys/kernel/mm/hugepages/hugepages-1048576kB # cat demote_size 2048kB # pwd /sys/kernel/mm/hugepages/hugepages-2048kB # cat demote_size 4kB The "demote size" is not user configurable. Although, that is something brought up by Oscar previously. I did not directly address this in the RFC. My bad. However, I do not like the idea of making demote_size writable/selectable. My concern would be someone changing the value and not resetting. It certainly is something that can be done with minor code changes. >> - demote Writable number of hugetlb pages to be demoted > > So how does this interface work? Write the target size to > `demote_size', write the number of to-be-demoted larger pages to > `demote' and then the operation happens? > > If so, how does one select which size pages should be selected for > the demotion? The location in the sysfs directory tells you what size pages will be demoted. For example, echo 5 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote says to demote 5 1GB pages. demote files are also in node specific directories so you can even pick huge pages from a specific node. echo 5 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/demote > > And how does one know the operation has completed so the sysfs files > can be reloaded for another operation? > When the write to the file is complete, the operation has completed. Not exactly sure what you mean by reloading the sysfs files for another operation? >> Only hugetlb pages which are free at the time of the request can be demoted. >> Demotion does not add to the complexity surplus pages. Demotion also honors >> reserved huge pages. Therefore, when a value is written to the sysfs demote >> file, that value is only the maximum number of pages which will be demoted. >> It is possible fewer will actually be demoted. >> >> If demote_size is PAGESIZE, demote will simply free pages to the buddy >> allocator. >> >> Real world use cases >> -------------------- >> There are groups today using hugetlb pages to back VMs on x86. Their >> use case is as described above. They have experienced the issues with >> performance and not necessarily getting the excepted number smaller huge > > ("expected") yes, will fix typo > >> pages after free/allocate cycle. >> > > It seems odd to add the interfaces in patch 1 then document them in > patch 5. Why not add-and-document in a single patch? > Yes, makes sense. Will combine these. -- Mike Kravetz