On 11/17/2016 09:52 PM, Mike Kravetz wrote: > On 11/17/2016 04:05 PM, Andrea Arcangeli wrote: >> On Thu, Nov 17, 2016 at 11:26:17AM -0800, Mike Kravetz wrote: >>> On 11/17/2016 07:40 AM, Andrea Arcangeli wrote: >>>> On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote: >>>>> I was running some tests with error injection to exercise the error >>>>> path and noticed the reservation leaks as the system eventually ran >>>>> out of huge pages. I need to think about it some more, but we may >>>>> want to at least do something like the following before put_page (with >>>>> a BIG comment): >>>>> >>>>> if (unlikely(PagePrivate(page))) >>>>> ClearPagePrivate(page); >>>>> >>>>> That would at least keep the global reservation count from increasing. >>>>> Let me look into that. >>>> >>>> However what happens if the old vma got munmapped >>> >>> When the huge page was allocated, the reservation map associated with >>> the vma was marked to indicate the reservation was consumed. In addition >>> the global reservation count and subpool count were adjusted to account >>> for the page allocation. So, when the vma gets unmapped the reservation >>> map will be examined. Since the map indicates the reservation was consumed, >>> no adjustment will be made to the global or subpool reservation count. >> >> ClearPagePrivate before put_page, will simply avoid to run >> h->resv_huge_pages++? > > Correct > >> Not increasing resv_huge_pages means more non reserved allocations >> will pass. That is a global value though, how is it ok to leave it >> permanently lower? > > Recall that h->resv_huge_pages is incremented when a mapping/vma is > created to indicate that reservations exist. It is decremented when > a huge page is allocated. In addition, the reservation map is > initially set up to indicate that reservations exist for the > pages in the mapping. When a page is allocated the map is modified > to indicate a reservation was consumed. > > So path, resv_huge_pages was decremented as the page was allocated > and the reserve map indicates the reservation was consumed. At this > time, resv_huge_pages and the reserve map accurately reflect the > state of reservations in the vma and globally. > > In this path, we were unable instantiate the page in the mapping/vma > so we must free the page. The "ideal" solution would be to adjust the > reserve mat to indicate the reservation was not consumed, and increment > resv_huge_pages. However, since we dropped mmap_sema we can not adjust > the reserve map. So, the questions is what should be done with > resv_huge_pages? There are only two options. > 1) Leave it "as is" when the page is free'ed. > In this case, the reserve map will indicate the reservation was consumed > but there will not be an associated page in the mapping. Therefore, > it is possible that a susbequent fault can happen on the page. At that > time, it will appear as though the reservation for the page was already > consumed. Therefore, is there are not any available pages the fault will > fail (ENOMEM). Note that this only impacts subsequent faults of this > specific mapping/page. > 2) Increment it when the page is free'ed > As above, the reserve map will still indicate the reservation was > consumed > and subsequent faults can only be satisfied if there are available pages. > However, there is now a global reservation as indicated by > resv_huge_pages > that is not associated with any mapping. What this means is that there > is a reserved page and nobody can use. The reservation is being held for > a mapping that has already consumed the reservation. > >> If PagePrivate was set, it means alloc_huge_page already run this: >> >> SetPagePrivate(page); >> h->resv_huge_pages--; >> >> But it would also have set a reserve map on the vma for that range >> before that. >> >> When the vma is destroyed the reserve is flushed back to global, minus >> the consumed pages (reserve = (end - start) - region_count(resv, >> start, end)). > > Consider a very simply 1 page vma. Obviously, (end - start) = 1 and > as mentioned above the reserve map was adjusted to indicate the reservation > was consumed. So, region_count(resv, start, end) is also going to = 1 > and reserve will = 0. Therefore, no adjustment of resv_huge_pages will > be made when the the vma is destroyed. > > The same happens for the the single page within a larger vma. > >> Why should then we skip h->resv_huge_pages++ for the consumed pages by >> running ClearPagePrivate? >> >> It's not clear was wrong in the first place considering >> put_page->free_huge_page() acts on the global stuff only? > > See the description of 2) above. > >> void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, >> unsigned long address, struct page *page) >> { >> if (unlikely(PagePrivate(page))) { >> long rc = vma_needs_reservation(h, vma, address); >> >> if (unlikely(rc < 0)) { >> /* >> * Rare out of memory condition in reserve map >> * manipulation. Clear PagePrivate so that >> * global reserve count will not be incremented >> * by free_huge_page. This will make it appear >> * as though the reservation for this page was >> * consumed. This may prevent the task from >> * faulting in the page at a later time. This >> * is better than inconsistent global huge page >> * accounting of reserve counts. >> */ >> ClearPagePrivate(page); >> >> The ClearPagePrivate was run above because vma_needs_reservation run >> out of memory and couldn't be added? > > restore_reserve_on_error tries to do the "ideal" cleanup by adjusting > the reserve map to indicate the reservation was not consumed. If it > can do this, then it does not ClearPagePrivate(page) as the global > reservation count SHOULD be incremented as the reserve map now indicates > the reservation as not consumed. > > The only time it can not adjust the reserve map is if the reserve map > map manipulation routines fail. They would only fail if they can not > allocate a 32 byte structure used to track reservations. As noted in > the comments this is rare, and if we can't allocate 32 bytes I suspect > there are other issues. But, in this case ClearPagePrivate is run so > that when the page is free'ed the global count will be consistent with > the reserve map. This is essentially why I think we should do the same > in __mcopy_atomic_hugetlb. > >> So I suppose the vma reservation wasn't possible in the above case, in >> our allocation case alloc_huge_page succeeded at those reserve maps >> allocations: >> >> map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); >> if (map_chg < 0) >> return ERR_PTR(-ENOMEM); >> [..] >> if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) { >> SetPagePrivate(page); >> h->resv_huge_pages--; >> } >> > > It is slightly different. In alloc_huge_page we are adjusting the reserve > map to indicate the reservation was consumed. In restore_reserve_on_error > we are adjusting the map to indicate the reservation was not consumed. > >>>> and a new compatible >>>> vma was instantiated and passes revalidation fine? The reserved page >>>> of the old vma goes to a different vma then? >>> >>> No, the new vma should get a new reservation. It can not use the old >>> reservation as it was associated with the old vma. This is at least >>> the case for private mappings where the reservation maps are associated >>> with the vma. >> >> You're not suggesting to call ClearPagePrivate in the second pass of >> the "retry" loop if all goes fine and second pass succeeds, but only if >> we end up in a error of revalidation at the second pass? > > That is correct. Only on the error path. We will only call put_page > on the error path and we only call ClearPagePrivate before put_page. > >> >> So the page with PagePrivate set could go to a different vma despite >> the vma reserve map was accounted for in the original vma? Is that ok? > > Yes, see the potential issue with subsequent faults. It is not the > "ideal" case that would be sone in restore_reserve_on_error, but I > think it is the beast alternative. And, I'm guessing this error > path is not something we will hit frequently? Or, do you think it > will be exercised often? > >>>> This reservation code is complex and has lots of special cases anyway, >>>> but the main concern at this point is the >>>> set_page_private(subpool_vma(vma)) released by >>>> hugetlb_vm_op_close->unlock_or_release_subpool. >>> >>> Do note that set_page_private(subpool_vma(vma)) just indicates which >>> subpool was used when the huge page was allocated. I do not believe >>> there is any connection made to the vma. The vma is only used to get >>> to the inode and superblock which contains subpool information. With >>> the subpool stored in page_private, the subpool count can be adjusted >>> at free_huge_page time. Also note that the subpool can not be free'ed >>> in unlock_or_release_subpool until put_page is complete for the page. >>> This is because the page is accounted for in spool->used_hpages. >> >> Yes I figured myself shortly later used_hpages. So there's no risk of >> use after free on the subpool pointed by the page at least. >> >> I also considered shutting down this accounting entirely by calling >> alloc_huge_page(allow_reserve = 0) in hugetlbfs mcopy atomic... Can't >> we start that way so we don't have to worry about the reservation >> accounting at all? > > Does "allow_reserve = 0" indicate alloc_huge_page(... avoid_reserve = 1)? > I think, that is what you are asking. > > Well we could do that. But, it could cause failures. Again consider > an overly simple case of a 1 page vma. Also, suppose there is only one > huge page in the system. When the vma is created/mapped the one huge > page is reserved. However, we call alloc_huge_page(...avoid_reserve = 1) > the allocation will fail as we indicated that reservations should not > be used. If there was another page in the system, or we were configured > to allocate surplus pages then it may succeed. > >>>> Aside the accounting, what about the page_private(page) subpool? It's >>>> used by huge_page_free which would get out of sync with vma/inode >>>> destruction if we release the mmap_sem. >>> >>> I do not think that is the case. Reservation and subpool adjustments >>> made at vma/inode destruction time are based on entries in the reservation >>> map. Those entries are created/destroyed when holding mmap_sem. >>> >>>> struct hugepage_subpool *spool = >>>> (struct hugepage_subpool *)page_private(page); >>>> >>>> I think in the revalidation code we need to check if >>>> page_private(page) still matches the subpool_vma(vma), if it doesn't >>>> and it's a stale pointer, we can't even call put_page before fixing up >>>> the page_private first. >>> >>> I do not think that is correct. page_private(page) points to the subpool >>> used when the page was allocated. Therefore, adjustments were made to that >>> subpool when the page was allocated. We need to adjust the same subpool >>> when calling put_page. I don't think there is any need to look at the >>> vma/subpool_vma(vma). If it doesn't match, we certainly do not want to >>> adjust counts in a potentially different subpool when calling page_put. >> >> Isn't the subpool different for every mountpoint of hugetlbfs? > > yes. > >> >> The old vma subpool can't be a stale pointer, because of the >> used_hpages but if there are two different hugetlbfs mounts the >> subpool seems to come from the superblock so it may change after we >> release the mmap_sem. >> >> Don't we have to add a check for the new vma subpool change against >> the page->private? >> >> Otherwise we'd be putting the page in some other subpool than the one >> it was allocated from, as long as they pass the vma_hpagesize != >> vma_kernel_pagesize(dst_vma) check. > > I don't think there is an issue. page->private will be set to point to > the subpool where the page was originally charged. That will not change > until the page_put, and the same subpool will be used to adjust the > count. The page can't go to another vma, without first passing through > free_huge_page and doing proper subpool accounting. If it does go to > another vma, then alloc_huge_page will set it to the proper subpool. > >>> As you said, this reservation code is complex. It might be good if >>> Hillf could comment as he understands this code. >>> >>> I still believe a simple call to ClearPagePrivate(page) may be all we >>> need to do in the error path. If this is the case, the only downside >>> is that it would appear the reservation was consumed for that page. >>> So, subsequent faults 'might' not get a huge page. >> >> I thought running out of hugepages is what you experienced already >> with the current code if using error injection. > > Yes, what I experienced is described in 2) of my first comment in this > e-mail. I would eventually end up with all huge pages being "reserved" > and unable to use/allocate them. So, none of the huge pages in the system > (or memory associated with them) could be used until reboot. :( I am not sure if you are convinced ClearPagePrivate is an acceptable solution to this issue. If you do, here is the simple patch to add it and an appropriate comment. From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Date: Mon, 21 Nov 2016 14:16:33 -0800 Subject: [PATCH] userfaultfd: hugetlbfs: reserve count on error in __mcopy_atomic_hugetlb If __mcopy_atomic_hugetlb exits with an error, put_page will be called if a huge page was allocated and needs to be freed. If a reservation was associated with the huge page, the PagePrivate flag will be set. Clear PagePrivate before calling put_page/free_huge_page so that the global reservation count is not incremented. Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> --- mm/userfaultfd.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index b565481..d56ba83 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -303,8 +303,23 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, out_unlock: up_read(&dst_mm->mmap_sem); out: - if (page) + if (page) { + /* + * We encountered an error and are about to free a newly + * allocated huge page. It is possible that there was a + * reservation associated with the page that has been + * consumed. See the routine restore_reserve_on_error + * for details. Unfortunately, we can not call + * restore_reserve_on_error now as it would require holding + * mmap_sem. Clear the PagePrivate flag so that the global + * reserve count will not be incremented in free_huge_page. + * The reservation map will still indicate the reservation + * was consumed and possibly prevent later page allocation. + * This is better than leaking a global reservation. + */ + ClearPagePrivate(page); put_page(page); + } BUG_ON(copied < 0); BUG_ON(err > 0); BUG_ON(!copied && !err); -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>