Hi, this patch is an alternative implementation of the hugetlbfs directio optimization discussed earlier. We've been looking into this with Khalid last week and an earlier version of this patch (fully equivalent as far as CPU overhead is concerned) was benchmarked by Khalid and it didn't degrade performance compared to the PageHuge check in current upstream code, so we should be good. The patch applies cleanly only after reverting 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review it in this form as it avoids all the alignment changes. I'll resend to Andrew against current upstream by squashing it with the revert after reviews. I wished to remove the _mapcount tailpage refcounting for slab and hugetlbfs tails too, but if the last put_page of a slab tail happens after the slab page isn't a slab page anymore (but still compound as it wasn't freed yet because of the tail pin), a VM_BUG_ON would trigger during the last (unpinning) put_page(slab_tail) with the mapcount underflow: VM_BUG_ON(page_mapcount(page) <= 0); Not even sure if any driver is doing anything like that, but the current code would allow it, Pravin should know more about when exactly in which conditions the last put_page is done on slab tail pages. It shall be possible to remove the _mapcount refcounting anyway, as it is only read by split_huge_page and so it doesn't actually matter if it underflows, but I prefer to keep the VM_BUG_ON. In fact I added one more VM_BUG_ON(!PageHead()) even in this patch. I also didn't notice we missed a PageHead check before calling __put_single_page(page_head), so I corrected that. It sounds very unlikely that it could have ever triggered but still better to fix it. I just booted it... not very well tested yet. Help with the testing appreciated :). ===== >From e19e40b4105ed5a851cdb3a748430bd715b1626a Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@xxxxxxxxxx> Date: Tue, 5 Nov 2013 19:53:55 +0100 Subject: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization The patch from commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 can cause dereference of a dangling pointer if split_huge_page runs during PageHuge() if there are updates to the tail_page->private field. Also it is repeating compound_head twice for hugetlbfs and it is running compound_head+compound_trans_head for THP when a single one is needed in both cases. The new code within the PageSlab() check doesn't need to verify that the THP page size is never bigger than the smallest hugetlbfs page size, to avoid memory corruption. A longstanding theoretical race condition was found while fixing the above (see the change right after the skip_unlock label, that is relevant for the compound_lock path too). Khalid have run performance tests to ensure these changes do not impact I/O performance. Reported-by: Khalid Aziz <khalid.aziz@xxxxxxxxxx> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> --- include/linux/hugetlb.h | 6 ++++ mm/hugetlb.c | 17 ++++++++++ mm/swap.c | 85 +++++++++++++++++++++++++++++++++++++++---------- 3 files changed, 92 insertions(+), 16 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 0393270..6125579 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -31,6 +31,7 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); void hugepage_put_subpool(struct hugepage_subpool *spool); int PageHuge(struct page *page); +int PageHeadHuge(struct page *page_head); void reset_vma_resv_huge_pages(struct vm_area_struct *vma); int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); @@ -104,6 +105,11 @@ static inline int PageHuge(struct page *page) return 0; } +static inline int PageHeadHuge(struct page *page_head) +{ + return 0; +} + static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 0b7656e..f0a4ca4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -736,6 +736,23 @@ int PageHuge(struct page *page) } EXPORT_SYMBOL_GPL(PageHuge); +/* + * PageHeadHuge() only returns true for hugetlbfs head page, but not for + * normal or transparent huge pages. + */ +int PageHeadHuge(struct page *page_head) +{ + compound_page_dtor *dtor; + + if (!PageHead(page_head)) + return 0; + + dtor = get_compound_page_dtor(page_head); + + return dtor == free_huge_page; +} +EXPORT_SYMBOL_GPL(PageHeadHuge); + pgoff_t __basepage_index(struct page *page) { struct page *page_head = compound_head(page); diff --git a/mm/swap.c b/mm/swap.c index f3f54df..519aff5 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -31,6 +31,7 @@ #include <linux/memcontrol.h> #include <linux/gfp.h> #include <linux/uio.h> +#include <linux/hugetlb.h> #include "internal.h" @@ -90,21 +91,41 @@ static void put_compound_page(struct page *page) unsigned long flags; /* - * THP can not break up slab pages so avoid taking - * compound_lock(). Slab performs non-atomic bit ops - * on page->flags for better performance. In particular - * slab_unlock() in slub used to be a hot path. It is - * still hot on arches that do not support - * this_cpu_cmpxchg_double(). + * THP can not break up slab pages or + * hugetlbfs pages so avoid taking + * compound_lock() and skip the tail page + * refcounting (in _mapcount) too. Slab + * performs non-atomic bit ops on page->flags + * for better performance. In particular + * slab_unlock() in slub used to be a hot + * path. It is still hot on arches that do + * not support this_cpu_cmpxchg_double(). */ - if (PageSlab(page_head)) { - if (PageTail(page)) { + if (PageSlab(page_head) || PageHeadHuge(page_head)) { + if (likely(PageTail(page))) { + /* + * __split_huge_page_refcount + * cannot race here. + */ + VM_BUG_ON(!PageHead(page_head)); + atomic_dec(&page->_mapcount); if (put_page_testzero(page_head)) VM_BUG_ON(1); - - atomic_dec(&page->_mapcount); - goto skip_lock_tail; + if (put_page_testzero(page_head)) + __put_compound_page(page_head); + return; } else + /* + * __split_huge_page_refcount + * run before us, "page" was a + * THP tail. The split + * page_head has been freed + * and reallocated as slab or + * hugetlbfs page of smaller + * order (only possible if + * reallocated as slab on + * x86). + */ goto skip_lock; } /* @@ -118,8 +139,27 @@ static void put_compound_page(struct page *page) /* __split_huge_page_refcount run before us */ compound_unlock_irqrestore(page_head, flags); skip_lock: - if (put_page_testzero(page_head)) - __put_single_page(page_head); + if (put_page_testzero(page_head)) { + /* + * The head page may have been + * freed and reallocated as a + * compound page of smaller + * order and then freed again. + * All we know is that it + * cannot have become: a THP + * page, a compound page of + * higher order, a tail page. + * That is because we still + * hold the refcount of the + * split THP tail and + * page_head was the THP head + * before the split. + */ + if (PageHead(page_head)) + __put_compound_page(page_head); + else + __put_single_page(page_head); + } out_put_single: if (put_page_testzero(page)) __put_single_page(page); @@ -141,7 +181,6 @@ out_put_single: VM_BUG_ON(atomic_read(&page->_count) != 0); compound_unlock_irqrestore(page_head, flags); -skip_lock_tail: if (put_page_testzero(page_head)) { if (PageHead(page_head)) __put_compound_page(page_head); @@ -189,13 +228,27 @@ bool __get_page_tail(struct page *page) struct page *page_head = compound_trans_head(page); if (likely(page != page_head && get_page_unless_zero(page_head))) { - /* Ref to put_compound_page() comment. */ - if (PageSlab(page_head)) { + if (PageSlab(page_head) || PageHeadHuge(page_head)) { if (likely(PageTail(page))) { + /* + * This is a hugetlbfs page or a slab + * page. __split_huge_page_refcount + * cannot race here. + */ + VM_BUG_ON(!PageHead(page_head)); __get_page_tail_foll(page, false); return true; } else { + /* + * __split_huge_page_refcount run + * before us, "page" was a THP + * tail. The split page_head has been + * freed and reallocated as slab or + * hugetlbfs page of smaller order + * (only possible if reallocated as + * slab on x86). + */ put_page(page_head); return false; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>