On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote: > > > On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: > > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: > > > Currently when truncating shmem file, if the range is partial of THP > > > (start or end is in the middle of THP), the pages actually will just get > > > cleared rather than being freed unless the range cover the whole THP. > > > Even though all the subpages are truncated (randomly or sequentially), > > > the THP may still be kept in page cache. This might be fine for some > > > usecases which prefer preserving THP. > > > > > > But, when doing balloon inflation in QEMU, QEMU actually does hole punch > > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used. > > > So, when using shmem THP as memory backend QEMU inflation actually doesn't > > > work as expected since it doesn't free memory. But, the inflation > > > usecase really needs get the memory freed. Anonymous THP will not get > > > freed right away too but it will be freed eventually when all subpages are > > > unmapped, but shmem THP would still stay in page cache. > > > > > > To protect the usecases which may prefer preserving THP, introduce a > > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is > > > preferred behavior if truncating partial THP. This mode just makes > > > sense to tmpfs for the time being. > > We need to clarify interaction with khugepaged. This implementation > > doesn't do anything to prevent khugepaged from collapsing the range back > > to THP just after the split. > > Yes, it doesn't. Will clarify this in the commit log. Okay, but I'm not sure that documention alone will be enough. We need proper design. > > > @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, > > > } > > > unlock_page(page); > > > } > > > +rescan_split: > > > pagevec_remove_exceptionals(&pvec); > > > pagevec_release(&pvec); > > > + > > > + if (split && PageTransCompound(page)) { > > > + /* The THP may get freed under us */ > > > + if (!get_page_unless_zero(compound_head(page))) > > > + goto rescan_out; > > > + > > > + lock_page(page); > > > + > > > + /* > > > + * The extra pins from page cache lookup have been > > > + * released by pagevec_release(). > > > + */ > > > + if (!split_huge_page(page)) { > > > + unlock_page(page); > > > + put_page(page); > > > + /* Re-look up page cache from current index */ > > > + goto again; > > > + } > > > + unlock_page(page); > > > + put_page(page); > > > + } > > > +rescan_out: > > > index++; > > > } > > Doing get_page_unless_zero() just after you've dropped the pin for the > > page looks very suboptimal. > > If I don't drop the pins the THP can't be split. And, there might be more > than one pins from find_get_entries() if I read the code correctly. For > example, truncate 8K length in the middle of THP, the THP's refcount would > get bumpped twice since two sub pages would be returned. Pin the page before pagevec_release() and avoid get_page_unless_zero(). Current code is buggy. You need to check that the page is still belong to the file after speculative lookup. -- Kirill A. Shutemov