On Mon, Mar 04, 2019 at 03:51:31PM -0800, Mike Kravetz wrote: > On 3/2/19 12:12 AM, gregkh@xxxxxxxxxxxxxxxxxxx wrote: > > > > The patch below does not apply to the 4.14-stable tree. > > If someone wants it applied there, or to any other stable or longterm > > tree, then please email the backport, including the original git commit > > id to <stable@xxxxxxxxxxxxxxx>. > > From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Date: Mon, 4 Mar 2019 15:36:59 -0800 > Subject: [PATCH] hugetlbfs: fix races and page leaks during migration > > commit cb6acd01e2e43fd8bad11155752b7699c3d0fb76 upstream. > > hugetlb pages should only be migrated if they are 'active'. The routines > set/clear_page_huge_active() modify the active state of hugetlb pages. > When a new hugetlb page is allocated at fault time, set_page_huge_active > is called before the page is locked. Therefore, another thread could > race and migrate the page while it is being added to page table by the > fault code. This race is somewhat hard to trigger, but can be seen by > strategically adding udelay to simulate worst case scheduling behavior. > Depending on 'how' the code races, various BUG()s could be triggered. > > To address this issue, simply delay the set_page_huge_active call until > after the page is successfully added to the page table. > > Hugetlb pages can also be leaked at migration time if the pages are > associated with a file in an explicitly mounted hugetlbfs filesystem. > For example, consider a two node system with 4GB worth of huge pages > available. A program mmaps a 2G file in a hugetlbfs filesystem. It > then migrates the pages associated with the file from one node to > another. When the program exits, huge page counts are as follows: > > node0 > 1024 free_hugepages > 1024 nr_hugepages > > node1 > 0 free_hugepages > 1024 nr_hugepages > > Filesystem Size Used Avail Use% Mounted on > nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool > > That is as expected. 2G of huge pages are taken from the free_hugepages > counts, and 2G is the size of the file in the explicitly mounted > filesystem. If the file is then removed, the counts become: > > node0 > 1024 free_hugepages > 1024 nr_hugepages > > node1 > 1024 free_hugepages > 1024 nr_hugepages > > Filesystem Size Used Avail Use% Mounted on > nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool > > Note that the filesystem still shows 2G of pages used, while there > actually are no huge pages in use. The only way to 'fix' the > filesystem accounting is to unmount the filesystem > > If a hugetlb page is associated with an explicitly mounted filesystem, > this information in contained in the page_private field. At migration > time, this information is not preserved. To fix, simply transfer > page_private from old to new page at migration time if necessary. > > There is a related race with removing a huge page from a file and > migration. When a huge page is removed from the pagecache, the > page_mapping() field is cleared, yet page_private remains set until the > page is actually freed by free_huge_page(). A page could be migrated > while in this state. However, since page_mapping() is not set the > hugetlbfs specific routine to transfer page_private is not called and > we leak the page count in the filesystem. To fix, check for this > condition before migrating a huge page. If the condition is detected, > return EBUSY for the page. > > Cc: <stable@xxxxxxxxxxxxxxx> > Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active") > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > --- > fs/hugetlbfs/inode.c | 12 ++++++++++++ > mm/hugetlb.c | 16 +++++++++++++--- > mm/migrate.c | 11 +++++++++++ > 3 files changed, 36 insertions(+), 3 deletions(-) Thanks for all 4 of these, now queued up. greg k-h