On 01/05/2017 07:15 AM, Michal Hocko wrote: > Hi, > we have a customer report on an older kernel (3.12) but I believe the > same issue is present in the current vanilla kernel. There is a race > between mmap trying to do a reservation which fails when racing with > truncate_hugepages. See the reproduced attached. > > It should go like this (analysis come from the customer and I hope I > haven't screwed their write up). > > : Task (T1) does mmap and calls into gather_surplus_pages(), looking for N > : pages. It determines it needs to allocate N pages, drops the lock, and > : does so. > : > : We will have: > : hstate->resv_huge_pages == N > : hstate->free_huge_pages == N > : > : That mapping is then munmap()ed by task T2, which truncates the file: > : > : truncate_hugepages() { > : for each page of the inode after lstart { > : truncate_huge_page(page) { > : hugetlb_unreserve_pages() { > : hugetlb_acct_memory() { > : return_unused_surplus_pages() { > : > : return_unused_surplus_pages() drops h->resv_huge_pages to 0, then > : begins calling free_pool_huge_page() N times: > : > : h->resv_huge_pages -= unused_resv_pages > : while (nr_pages--) { > : free_pool_huge_page(h, &node_states[N_MEMORY], 1) { > : h->free_huge_pages--; > : } > : cond_resched_lock(&hugetlb_lock); > : } > : > : But the cond_resched_lock() triggers, and it releases the lock with > : > : h->resv_huge_pages == 0 > : h->free_huge_pages == M << N > : > : T1 having completed its allocations with allocated == N now > : acquires the lock, and recomputes > : > : needed = (h->resv_huge_pages + delta) - (h->free_huge_pages + allocated); > : > : needed = N - (M + N) = -M > : > : Then > : > : needed += N = -M+N > : h->resv_huge_pages += N = N > : > : It frees N-M pages to the hugetlb pool via enqueue_huge_page(), > : > : list_for_each_entry_safe(page, tmp, &surplus_list, lru) { > : if ((--needed) < 0) > : break; > : /* > : * This page is now managed by the hugetlb allocator and has > : * no users -- drop the buddy allocator's reference. > : */ > : put_page_testzero(page); > : VM_BUG_ON(page_count(page)); > : enqueue_huge_page(h, page) { > : h->free_huge_pages++; > : } > : } > : > : h->resv_huge_pages == N > : h->free_huge_pages == N-M Are you sure about free_huge_page? When we entered the routine h->free_huge_pages == M << N After the above loop, I think h->free_huge_pages == M + (N-M) > : > : It releases the lock in order to free the remainder of surplus_list > : via put_page(). > : > : When it releases the lock, T1 reclaims it and returns from > : gather_surplus_pages(). > : > : But then hugetlb_acct_memory() checks > : > : if (delta > cpuset_mems_nr(h->free_huge_pages_node)) { > : return_unused_surplus_pages(h, delta); > : goto out; > : } > : > : and returns -ENOMEM. I'm wondering if this may have more to do with numa allocations of surplus pages. Do you know if customer uses any memory policy for allocations? There was a change after 3.12 for this (commit 099730d67417). > > The cond_resched has been added by 7848a4bf51b3 ("mm/hugetlb.c: add > cond_resched_lock() in return_unused_surplus_pages()") and it smells > fishy AFAICT. It leaves the inconsistent state of the hstate behind. > I guess we want to uncommit the reservation one page at the time, something like: > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 3edb759c5c7d..e3a599146d7c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1783,12 +1783,13 @@ static void return_unused_surplus_pages(struct hstate *h, > { > unsigned long nr_pages; > > - /* Uncommit the reservation */ > - h->resv_huge_pages -= unused_resv_pages; > > /* Cannot return gigantic pages currently */ > - if (hstate_is_gigantic(h)) > + if (hstate_is_gigantic(h)) { > + /* Uncommit the reservation */ > + h->resv_huge_pages -= unused_resv_pages; > return; > + } > > nr_pages = min(unused_resv_pages, h->surplus_huge_pages); > > @@ -1803,6 +1804,7 @@ static void return_unused_surplus_pages(struct hstate *h, > while (nr_pages--) { > if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) > break; > + h->resv_huge_pages--; > cond_resched_lock(&hugetlb_lock); > } > } > > but I am just not getting the nr_pages = min... part and the way thing > how we can have less surplus_huge_pages than unused_resv_pages.... Think about the case where there are pre-allocated huge pages in the mix. Suppose you want to reserve 5 pages via mmap. There are 3 pre-allocated free pages which can be used for the reservation. However, 2 additional surplus pages will need to be allocated to cover all the reservations. In this case, I believe the code above would have: unused_resv_pages = 5 h->surplus_huge_pages = 2 So, the loop would only decrement resv_huge_pages by 2 and leak 3 pages. > This > whole code is so confusing Yes, I wrote about 5 replies to this e-mail and deleted them before hitting send as I later realized they were incorrect. I'm going to add to 'hugetlb reservations' to your proposed LSF/MM topic of areas in need of attention. > whole code is so confusing that I would even rather go with a simple > revert of 7848a4bf51b3 which would be much easier for the stable backport. > > What do you guys think? Let me think about it some more. At first, I thought it certainly was a bad idea to drop the lock in return_unused_surplus_pages. But, the more I think about it, the more I think it is OK. There should not be a problem with dropping the reserve count all at once. The reserve map which corresponds to the global reserve count has already been cleared. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>