On Fri, 11 Jun 2021 18:15:45 +0200 Jann Horn <jannh@xxxxxxxxxx> wrote: > try_grab_compound_head() is used to grab a reference to a page from > get_user_pages_fast(), which is only protected against concurrent > freeing of page tables (via local_irq_save()), but not against > concurrent TLB flushes, freeing of data pages, or splitting of compound > pages. > > Because no reference is held to the page when try_grab_compound_head() > is called, the page may have been freed and reallocated by the time its > refcount has been elevated; therefore, once we're holding a stable > reference to the page, the caller re-checks whether the PTE still points > to the same page (with the same access rights). > > The problem is that try_grab_compound_head() has to grab a reference on > the head page; but between the time we look up what the head page is and > the time we actually grab a reference on the head page, the compound > page may have been split up (either explicitly through split_huge_page() > or by freeing the compound page to the buddy allocator and then > allocating its individual order-0 pages). > If that happens, get_user_pages_fast() may end up returning the right > page but lifting the refcount on a now-unrelated page, leading to > use-after-free of pages. > > To fix it: > Re-check whether the pages still belong together after lifting the > refcount on the head page. > Move anything else that checks compound_head(page) below the refcount > increment. > > This can't actually happen on bare-metal x86 (because there, disabling > IRQs locks out remote TLB flushes), but it can happen on virtualized x86 > (e.g. under KVM) and probably also on arm64. The race window is pretty > narrow, and constantly allocating and shattering hugepages isn't exactly > fast; for now I've only managed to reproduce this in an x86 KVM guest with > an artificially widened timing window (by adding a loop that repeatedly > calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, > so that PV TLB flushes are used instead of IPIs). > > ... > > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -43,8 +43,21 @@ static void hpage_pincount_sub(struct page *page, int refs) > > atomic_sub(refs, compound_pincount_ptr(page)); > } > > +/* Equivalent to calling put_page() @refs times. */ > +static void put_page_refs(struct page *page, int refs) > +{ > + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); I don't think there's a need to nuke the whole kernel in this case. Can we warn then simply leak the page? That way we have a much better chance of getting a good bug report. > + /* > + * Calling put_page() for each ref is unnecessarily slow. Only the last > + * ref needs a put_page(). > + */ > + if (refs > 1) > + page_ref_sub(page, refs - 1); > + put_page(page); > +}