On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote: > Hi Roman, > > > On 4 Mar 2021, at 11:45, Roman Gushchin wrote: > > > On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote: > >> On 1 Mar 2021, at 20:59, Roman Gushchin wrote: > >> > >>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote: > >>>> From: Zi Yan <ziy@xxxxxxxxxx> > >>>> > >>>> Hi all, > >>>> > >>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29 > >>>> and the code is available at > >>>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29 > >>>> if you want to give it a try. The actual 49 patches are not sent out with this > >>>> cover letter. :) > >>>> > >>>> Instead of asking for code review, I would like to discuss on the concerns I got > >>>> from previous RFCs. I think there are two major ones: > >>>> > >>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA > >>>> regions that are reserved at boot time like hugetlbfs. The concerns on > >>>> using CMA is that an educated guess is needed to avoid depleting kernel > >>>> memory in case CMA regions are set too large. Recently David Rientjes > >>>> proposes to use process_madvise() for hugepage collapse, which is an > >>>> alternative [1] but might not work for 1GB pages, since there is no way of > >>>> _allocating_ a 1GB page to which collapse pages. I proposed a similar > >>>> approach at LSF/MM 2019, generating physically contiguous memory after pages > >>>> are allocated [2], which is usable for 1GB THPs. This approach does in-place > >>>> huge page promotion thus does not require page allocation. > >>> > >>> Well, I don't think there an alternative to cma as now. When the memory is almost > >>> filled at least once, any subsequent activity leading to substantial slab allocations > >>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances > >>> to find a continuous GB. > >>> > >>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping > >>> non-movable pageblocks, but it seems a separate project. > >> > >> My experiments showed that finding continuous GBs is possible, but I agree that > >> CMA is more reliable and 1GB scale defragmentation should be a separate project. > > > > I actually ran a large scale experiment (on tens of thousands of machines) in the last > > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same. > > Thanks for the information. I finally have time to come back to this. Do you mind sharing > the total memory of these machines? I want to have some idea on the scale of this issue to > make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs, > or TBs memory? There are different configurations, but in general they are in 100's GB or smaller. > > > > > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory). > > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations > > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not > > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it. > > Is there a way of replicating such an environment with publicly available software? > I really want to understand the root cause and am willing to find a possible solution. > It would be much easier if I can reproduce this locally. There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations). > > > > > Even with cma we had to fix a number of additional problems (like sub-optimal placement > > of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have > > a reasonable success rate about ~95-99%. And it's not 100% anyway. > > > > The problem with artificial tests is that you're likely experimenting on a freshly > > rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory > > state of a production server. > > Yes, I agree that my experiment is not representative. Can you provide more information > on what application behavior(s) leading to this memory fragmentation? I guess it is > because non-moveable pages spread across the entire physical memory space. Is there > a quick reproducer for that? I have a simple c program which is able to fragment the memory, you can play with it: https://github.com/rgushchin/fragm . But as I said, basically any load which is actively using the whole memory will fragment it. Thanks!