Re: [PATCH 1/2] mm: thp: introduce thp_mmu_gather to pin tail pages during MMU gather

"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxxxxxxx> · Sat, 05 Dec 2015 13:54:51 +0530

Andrea Arcangeli <aarcange@xxxxxxxxxx> writes:

> On Thu, Nov 19, 2015 at 04:22:55PM -0800, Andrew Morton wrote:
>> On Thu, 19 Nov 2015 14:00:51 +0100 Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>> 
>> > This theoretical SMP race condition was found with source review. No
>> > real life app could be affected as the result of freeing memory while
>> > accessing it is either undefined or it's a workload the produces no
>> > information.
>> > 
>> > For something to go wrong because the SMP race condition triggered,
>> > it'd require a further tiny window within the SMP race condition
>> > window. So nothing bad is happening in practice even if the SMP race
>> > condition triggers. It's still better to apply the fix to have the
>> > math guarantee.
>> > 
>> > The fix just adds a thp_mmu_gather atomic_t counter to the THP pages,
>> > so split_huge_page can elevate the tail page count accordingly and
>> > leave the tail page freeing task to whoever elevated thp_mmu_gather.
>> > 
>> 
>> This is a pretty nasty patch :( We now have random page*'s with bit 0
>> set floating around in mmu_gather.__pages[].  It assumes/requires that
>
> Yes, and bit 0 is only relevant for the mmu_gather structure and its
> users.
>
>> nobody uses those pages until they hit release_pages().  And the tlb
>> flushing code is pretty twisty, with various Kconfig and arch dependent
>> handlers.
>
> I already reviewed all callers and the mmu_gather freeing path for all
> archs to be sure they all take the two paths that I updated in order
> to free the pages. They call free_page_and_swap_cache or
> free_pages_and_swap_cache.
>
> The freeing is using common code VM functions, and it shouldn't
> improvise calling put_page() manually, the freeing has to take care of
> collecting the swapcache if needed etc... it has to deal with VM
> details the arch is not aware about.

But it is still lot of really complicated code. 

>
>> Is there no nicer way?
>
> We can grow the size of the mmu_gather to keep track that the page was
> THP before the pmd_lock was dropped, in a separate bit from the struct
> page pointer, but it'll take more memory.
>
> This bit 0 looks a walk in the park if compared to the bit 0 in
> page->compound_head that was just introduced. The compound_head bit 0
> isn't only visible to the mmu_gather users (which should never try to
> mess with the page pointer themself) and it collides with lru/next,
> rcu_head.next users.
>
> If you prefer me to enlarge the mmu_gather structure I can do that.
>

If we can update mmu_gather to track the page size of the pages, that
will also help some archs to better implement tlb_flush(struct
mmu_gather *). Right now arch/powerpc/mm/tlb_nohash.c does flush the tlb
mapping for the entire mm_struct. 

we can also make sure that we do a force flush when we are trying to
gather pages of different size. So one instance of mmu_gather will end
up gathering pages of specific size only ?

> 1 bit of extra information needs to be extracted before dropping the
> pmd_lock in zap_huge_pmd() and it has to be available in
> release_pages(), to know if the tail pages needs an explicit put_page
> or not. It's basically a bit in the local stack, except it's not in
> the local stack because it is an array of pages, so it needs many
> bits and it's stored in the mmu_gather along the page.
>
> Aside from the implementation of bit 0, I can't think of a simpler
> design that provides for the same performance and low locking overhead
> (this patch actually looks like an optimization when it's not helping
> to prevent the race, because to fix the race I had to reduce the
> number of times the lru_lock is taken in release_pages).
>
>> > +/*
>> > + * free_trans_huge_page_list() is used to free the pages returned by
>> > + * trans_huge_page_release() (if still PageTransHuge()) in
>> > + * release_pages().
>> > + */
>> 
>> There is no function trans_huge_page_release().
>
> Oops I updated the function name but not the comment... thanks!
> Andrea
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f7ae08f..2810322 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -225,9 +225,8 @@ static inline int trans_huge_mmu_gather_count(struct page *page)
>  }
>  
>  /*
> - * free_trans_huge_page_list() is used to free the pages returned by
> - * trans_huge_page_release() (if still PageTransHuge()) in
> - * release_pages().
> + * free_trans_huge_page_list() is used to free THP pages (if still
> + * PageTransHuge()) in release_pages().
>   */
>  extern void free_trans_huge_page_list(struct list_head *list);
>  
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>