Hi Mike, On Thu, Feb 21, 2019 at 11:11:06AM -0800, Mike Kravetz wrote: > On 2/20/19 10:09 PM, Andrew Morton wrote: > > On Tue, 12 Feb 2019 14:14:00 -0800 Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> index a80832487981..f859e319e3eb 100644 > >> --- a/mm/hugetlb.c > >> +++ b/mm/hugetlb.c ... > >> @@ -3863,6 +3862,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > >> } > >> > >> spin_unlock(ptl); > >> + > >> + /* May already be set if not newly allocated page */ > >> + set_page_huge_active(page); > >> + > > This is wrong. We need to only set_page_huge_active() for newly allocated > pages. Why? We could have got the page from the pagecache, and it could > be that the page is !page_huge_active() because it has been isolated for > migration. Therefore, we do not want to set it active here. > > I have also found another race with migration when removing a page from > a file. When a huge page is removed from the pagecache, the page_mapping() > field is cleared yet page_private continues to point to the subpool until > the page is actually freed by free_huge_page(). free_huge_page is what > adjusts the counts for the subpool. A page could be migrated while in this > state. However, since page_mapping() is not set the hugetlbfs specific > routine to transfer page_private is not called and we leak the page count > in the filesystem. To fix, check for this condition before migrating a huge > page. If the condition is detected, return EBUSY for the page. > > Both issues are addressed in the updated patch below. > > Sorry for the churn. As I find and fix one issue I seem to discover another. > There is still at least one more issue with private pages when COW comes into > play. I continue to work that. I wanted to send this patch earlier as it > is pretty easy to hit the bugs if you try. If you would prefer another > approach, let me know. > > From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Date: Thu, 21 Feb 2019 11:01:04 -0800 > Subject: [PATCH] huegtlbfs: fix races and page leaks during migration Subject still contains a typo. > > hugetlb pages should only be migrated if they are 'active'. The routines > set/clear_page_huge_active() modify the active state of hugetlb pages. > When a new hugetlb page is allocated at fault time, set_page_huge_active > is called before the page is locked. Therefore, another thread could > race and migrate the page while it is being added to page table by the > fault code. This race is somewhat hard to trigger, but can be seen by > strategically adding udelay to simulate worst case scheduling behavior. > Depending on 'how' the code races, various BUG()s could be triggered. > > To address this issue, simply delay the set_page_huge_active call until > after the page is successfully added to the page table. > > Hugetlb pages can also be leaked at migration time if the pages are > associated with a file in an explicitly mounted hugetlbfs filesystem. > For example, consider a two node system with 4GB worth of huge pages > available. A program mmaps a 2G file in a hugetlbfs filesystem. It > then migrates the pages associated with the file from one node to > another. When the program exits, huge page counts are as follows: > > node0 > 1024 free_hugepages > 1024 nr_hugepages > > node1 > 0 free_hugepages > 1024 nr_hugepages > > Filesystem Size Used Avail Use% Mounted on > nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool > > That is as expected. 2G of huge pages are taken from the free_hugepages > counts, and 2G is the size of the file in the explicitly mounted filesystem. > If the file is then removed, the counts become: > > node0 > 1024 free_hugepages > 1024 nr_hugepages > > node1 > 1024 free_hugepages > 1024 nr_hugepages > > Filesystem Size Used Avail Use% Mounted on > nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool > > Note that the filesystem still shows 2G of pages used, while there > actually are no huge pages in use. The only way to 'fix' the > filesystem accounting is to unmount the filesystem > > If a hugetlb page is associated with an explicitly mounted filesystem, > this information in contained in the page_private field. At migration > time, this information is not preserved. To fix, simply transfer > page_private from old to new page at migration time if necessary. > > There is a related race with removing a huge page from a file migration. > When a huge page is removed from the pagecache, the page_mapping() field > is cleared yet page_private remains set until the page is actually freed > by free_huge_page(). A page could be migrated while in this state. > However, since page_mapping() is not set the hugetlbfs specific routine > to transfer page_private is not called and we leak the page count in the > filesystem. To fix, check for this condition before migrating a huge > page. If the condition is detected, return EBUSY for the page. > > Cc: <stable@xxxxxxxxxxxxxxx> > Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active") > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > --- > fs/hugetlbfs/inode.c | 12 ++++++++++++ > mm/hugetlb.c | 12 +++++++++--- > mm/migrate.c | 11 +++++++++++ > 3 files changed, 32 insertions(+), 3 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index 32920a10100e..a7fa037b876b 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -859,6 +859,18 @@ static int hugetlbfs_migrate_page(struct address_space > *mapping, > rc = migrate_huge_page_move_mapping(mapping, newpage, page); > if (rc != MIGRATEPAGE_SUCCESS) > return rc; > + > + /* > + * page_private is subpool pointer in hugetlb pages. Transfer to > + * new page. PagePrivate is not associated with page_private for > + * hugetlb pages and can not be set here as only page_huge_active > + * pages can be migrated. > + */ > + if (page_private(page)) { > + set_page_private(newpage, page_private(page)); > + set_page_private(page, 0); > + } > + > if (mode != MIGRATE_SYNC_NO_COPY) > migrate_page_copy(newpage, page); > else > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a80832487981..e9c92e925b7e 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c ... > @@ -3863,6 +3864,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > } > > spin_unlock(ptl); > + > + /* Make newly allocated pages active */ You already have a perfect explanation about why we need this "if", > ... We could have got the page from the pagecache, and it could > be that the page is !page_huge_active() because it has been isolated for > migration. so you could improve this comment with it. Anyway, I agree to what/how you try to fix. Reviewed-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> Thanks, Naoya Horiguchi