On Thu, Feb 07, 2019 at 10:50:55AM -0800, Mike Kravetz wrote: > On 1/30/19 1:14 PM, Mike Kravetz wrote: > > Files can be created and mapped in an explicitly mounted hugetlbfs > > filesystem. If pages in such files are migrated, the filesystem > > usage will not be decremented for the associated pages. This can > > result in mmap or page allocation failures as it appears there are > > fewer pages in the filesystem than there should be. > > Does anyone have a little time to take a look at this? > > While migration of hugetlb pages 'should' not be a common issue, we > have seen it happen via soft memory errors/page poisoning in production > environments. Didn't see a leak in that case as it was with pages in a > Sys V shared mem segment. However, our DB code is starting to make use > of files in explicitly mounted hugetlbfs filesystems. Therefore, we are > more likely to hit this bug in the field. Hi Mike, Thank you for finding/reporting the problem. # sorry for my late response. > > > > > For example, a test program which hole punches, faults and migrates > > pages in such a file (1G in size) will eventually fail because it > > can not allocate a page. Reported counts and usage at time of failure: > > > > node0 > > 537 free_hugepages > > 1024 nr_hugepages > > 0 surplus_hugepages > > node1 > > 1000 free_hugepages > > 1024 nr_hugepages > > 0 surplus_hugepages > > > > Filesystem Size Used Avail Use% Mounted on > > nodev 4.0G 4.0G 0 100% /var/opt/hugepool > > > > Note that the filesystem shows 4G of pages used, while actual usage is > > 511 pages (just under 1G). Failed trying to allocate page 512. > > > > If a hugetlb page is associated with an explicitly mounted filesystem, > > this information in contained in the page_private field. At migration > > time, this information is not preserved. To fix, simply transfer > > page_private from old to new page at migration time if necessary. Also, > > migrate_page_states() unconditionally clears page_private and PagePrivate > > of the old page. It is unlikely, but possible that these fields could > > be non-NULL and are needed at hugetlb free page time. So, do not touch > > these fields for hugetlb pages. > > > > Cc: <stable@xxxxxxxxxxxxxxx> > > Fixes: 290408d4a250 ("hugetlb: hugepage migration core") > > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > --- > > fs/hugetlbfs/inode.c | 10 ++++++++++ > > mm/migrate.c | 10 ++++++++-- > > 2 files changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index 32920a10100e..fb6de1db8806 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -859,6 +859,16 @@ static int hugetlbfs_migrate_page(struct address_space *mapping, > > rc = migrate_huge_page_move_mapping(mapping, newpage, page); > > if (rc != MIGRATEPAGE_SUCCESS) > > return rc; > > + > > + /* > > + * page_private is subpool pointer in hugetlb pages, transfer > > + * if needed. > > + */ > > + if (page_private(page) && !page_private(newpage)) { > > + set_page_private(newpage, page_private(page)); > > + set_page_private(page, 0); You don't have to copy PagePrivate flag? > > + } > > + > > if (mode != MIGRATE_SYNC_NO_COPY) > > migrate_page_copy(newpage, page); > > else > > diff --git a/mm/migrate.c b/mm/migrate.c > > index f7e4bfdc13b7..0d9708803553 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -703,8 +703,14 @@ void migrate_page_states(struct page *newpage, struct page *page) > > */ > > if (PageSwapCache(page)) > > ClearPageSwapCache(page); > > - ClearPagePrivate(page); > > - set_page_private(page, 0); > > + /* > > + * Unlikely, but PagePrivate and page_private could potentially > > + * contain information needed at hugetlb free page time. > > + */ > > + if (!PageHuge(page)) { > > + ClearPagePrivate(page); > > + set_page_private(page, 0); > > + } # This argument is mainly for existing code... According to the comment on migrate_page(): /* * Common logic to directly migrate a single LRU page suitable for * pages that do not use PagePrivate/PagePrivate2. * * Pages are locked upon entry and exit. */ int migrate_page(struct address_space *mapping, ... So this common logic assumes that page_private is not used, so why do we explicitly clear page_private in migrate_page_states()? buffer_migrate_page(), which is commonly used for the case when page_private is used, does that clearing outside migrate_page_states(). So I thought that hugetlbfs_migrate_page() could do in the similar manner. IOW, migrate_page_states() should not do anything on PagePrivate. But there're a few other .migratepage callbacks, and I'm not sure all of them are safe for the change, so this approach might not fit for a small fix. # BTW, there seems a typo in $SUBJECT. Thanks, Naoya Horiguchi