Re: [PATCH 1/2] mm: thp: introduce thp_mmu_gather to pin tail pages during MMU gather

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Mon, 23 Nov 2015 17:03:02 +0100

On Thu, Nov 19, 2015 at 04:22:55PM -0800, Andrew Morton wrote:
> On Thu, 19 Nov 2015 14:00:51 +0100 Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> 
> > This theoretical SMP race condition was found with source review. No
> > real life app could be affected as the result of freeing memory while
> > accessing it is either undefined or it's a workload the produces no
> > information.
> > 
> > For something to go wrong because the SMP race condition triggered,
> > it'd require a further tiny window within the SMP race condition
> > window. So nothing bad is happening in practice even if the SMP race
> > condition triggers. It's still better to apply the fix to have the
> > math guarantee.
> > 
> > The fix just adds a thp_mmu_gather atomic_t counter to the THP pages,
> > so split_huge_page can elevate the tail page count accordingly and
> > leave the tail page freeing task to whoever elevated thp_mmu_gather.
> > 
> 
> This is a pretty nasty patch :( We now have random page*'s with bit 0
> set floating around in mmu_gather.__pages[].  It assumes/requires that

Yes, and bit 0 is only relevant for the mmu_gather structure and its
users.

> nobody uses those pages until they hit release_pages().  And the tlb
> flushing code is pretty twisty, with various Kconfig and arch dependent
> handlers.

I already reviewed all callers and the mmu_gather freeing path for all
archs to be sure they all take the two paths that I updated in order
to free the pages. They call free_page_and_swap_cache or
free_pages_and_swap_cache.

The freeing is using common code VM functions, and it shouldn't
improvise calling put_page() manually, the freeing has to take care of
collecting the swapcache if needed etc... it has to deal with VM
details the arch is not aware about.

> Is there no nicer way?

We can grow the size of the mmu_gather to keep track that the page was
THP before the pmd_lock was dropped, in a separate bit from the struct
page pointer, but it'll take more memory.

This bit 0 looks a walk in the park if compared to the bit 0 in
page->compound_head that was just introduced. The compound_head bit 0
isn't only visible to the mmu_gather users (which should never try to
mess with the page pointer themself) and it collides with lru/next,
rcu_head.next users.

If you prefer me to enlarge the mmu_gather structure I can do that.

1 bit of extra information needs to be extracted before dropping the
pmd_lock in zap_huge_pmd() and it has to be available in
release_pages(), to know if the tail pages needs an explicit put_page
or not. It's basically a bit in the local stack, except it's not in
the local stack because it is an array of pages, so it needs many
bits and it's stored in the mmu_gather along the page.

Aside from the implementation of bit 0, I can't think of a simpler
design that provides for the same performance and low locking overhead
(this patch actually looks like an optimization when it's not helping
to prevent the race, because to fix the race I had to reduce the
number of times the lru_lock is taken in release_pages).

> > +/*
> > + * free_trans_huge_page_list() is used to free the pages returned by
> > + * trans_huge_page_release() (if still PageTransHuge()) in
> > + * release_pages().
> > + */
> 
> There is no function trans_huge_page_release().

Oops I updated the function name but not the comment... thanks!
Andrea

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f7ae08f..2810322 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -225,9 +225,8 @@ static inline int trans_huge_mmu_gather_count(struct page *page)
 }
 
 /*
- * free_trans_huge_page_list() is used to free the pages returned by
- * trans_huge_page_release() (if still PageTransHuge()) in
- * release_pages().
+ * free_trans_huge_page_list() is used to free THP pages (if still
+ * PageTransHuge()) in release_pages().
  */
 extern void free_trans_huge_page_list(struct list_head *list);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>