On Thu, 2020-10-22 at 19:54 -0700, Hugh Dickins wrote: > On Thu, 22 Oct 2020, Rik van Riel wrote: > > > The allocation flags of anonymous transparent huge pages can be > controlled > > through the files in /sys/kernel/mm/transparent_hugepage/defrag, > which can > > help the system from getting bogged down in the page reclaim and > compaction > > code when many THPs are getting allocated simultaneously. > > > > However, the gfp_mask for shmem THP allocations were not limited by > those > > configuration settings, and some workloads ended up with all CPUs > stuck > > on the LRU lock in the page reclaim code, trying to allocate dozens > of > > THPs simultaneously. > > > > This patch applies the same configurated limitation of THPs to > shmem > > hugepage allocations, to prevent that from happening. > > > > This way a THP defrag setting of "never" or "defer+madvise" will > result > > in quick allocation failures without direct reclaim when no 2MB > free > > pages are available. > > > > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> > > NAK in its present untested form: see below. Oops. That issue is easy to fix, but indeed lets figure out what the desired behavior is. > I'm open to change here, particularly to Yu Xu's point (in other > mail) > about direct reclaim - we avoid that here in Google too: though it's > not so much to avoid the direct reclaim, as to avoid the latencies of > direct compaction, which __GFP_DIRECT_RECLAIM allows as a side- > effect. > > > @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode > *inode, pgoff_t index, > > } > > > > alloc_huge: > > - page = shmem_alloc_and_acct_page(gfp, inode, index, true); > > + huge_gfp = alloc_hugepage_direct_gfpmask(vma); > > Still looks nice: but what about the crash when vma is NULL? That's a one line fix, but I suppose we should get the discussion on what the code behavior should be out of the way first :) > Michal is right to remember pushback before, because tmpfs is a > filesystem, and "huge=" is a mount option: in using a huge=always > filesystem, the user has already declared a preference for huge > pages. > Whereas the original anon THP had to deduce that preference from sys > tunables and vma madvice. ... > But it's likely that they have accumulated some defrag wisdom, which > tmpfs can take on board - but please accept that in using a huge > mount, > the preference for huge has already been expressed, so I don't expect > anon THP alloc_hugepage_direct_gfpmask() choices will map one to one. In my mind, the huge= mount options for tmpfs corresponded to the "enabled" anon THP options, denoting a desired end state, not necessarily how much we will stall allocations to get there immediately. The underlying allocation behavior has been changed repeatedly, with changes to the direct reclaim code and the compaction deferral code. The shmem THP gfp_mask never tried really hard anyway, with __GFP_NORETRY being the default, which matches what is used for non-VM_HUGEPAGE anon VMAs. Likewise, the direct reclaim done from the opportunistic THP allocations done by the shmem code limited itself to reclaiming 32 4kB pages per THP allocation. In other words, mounting with huge=always has never behaved the same as the more aggressive allocations done for MADV_HUGEPAGE VMAs. This patch would leave shmem THP allocations for non-MADV_HUGEPAGE mapped files opportunistic like today, and make shmem THP allocations for files mapped with MADV_HUGEPAGE more aggressive than today. However, I would like to know what people think the shmem huge= mount options should do, and how things should behave when memory gets low, before pushing in a patch just because it makes the system run smoother "without changing current behavior too much". What do people want tmpfs THP allocations to do? -- All Rights Reversed.
Attachment:
signature.asc
Description: This is a digitally signed message part