Re: [rfc 5/5] mincore: transparent huge page support

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Thu, 25 Mar 2010 01:42:54 +0100

On Thu, Mar 25, 2010 at 01:07:49AM +0100, Johannes Weiner wrote:
> Stupid me.  I knew that, I just hadn't internalized it enough to do it
> right :)

Never mind ;)

> Btw, unless I miss something else, this is the same in follow_page()?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 22ee158..6c26042 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1301,18 +1301,14 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  	}
>  	if (pmd_trans_huge(*pmd)) {
>  		spin_lock(&mm->page_table_lock);

follow_page already checked pmd_trans_huge first outside the lock.

> -		if (likely(pmd_trans_huge(*pmd))) {

And then again inside. We have to re-check it inside to be safe,
otherwise we've to /* fall through */

> -			if (unlikely(pmd_trans_splitting(*pmd))) {
> -				spin_unlock(&mm->page_table_lock);
> -				wait_split_huge_page(vma->anon_vma, pmd);
> -			} else {
> -				page = follow_trans_huge_pmd(mm, address,
> -							     pmd, flags);

What I want to do in mincore is to call something like
mincore_trans_huge_pmd() to remove the #ifdef to make everyone happy.

> -				spin_unlock(&mm->page_table_lock);
> -				goto out;
> -			}
> -		} else
> +		if (unlikely(pmd_trans_splitting(*pmd))) {
>  			spin_unlock(&mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma, pmd);
> +		} else {
> +			page = follow_trans_huge_pmd(mm, address, pmd, flags);
> +			spin_unlock(&mm->page_table_lock);
> +			goto out;
> +		}

So it would miss one needed check inside the lock and it could lead to
call follow_trans_huge_pmd with a not huge pmd which is invalid.

Also note, touching with C language stuff that can change under you
like we do in the first check outside the lock (and isn't marked
volatile) may have unpredictable results in theory. But in practice we
do stuff like this all the time. Specifically in this case it's just
one bitflag check so it should be safe enough considering that
test_bit is implemented in C on x86 and does the same thing,
spin_is_locked likely is also implemented in C etc... In other cases
like "switch" parameters, gcc can implement it with a jump table, and
if the value changes under gcc it may jump beyond the end of the
table. It's a tradeoff between writing perfect C and having to deal
with asm all the time (or worse slowdown marking pmd volatile). Some
ages ago I guess I proposed to fix this including not assuming that
writes to the ptes are atomic with one instruction if done in C, but
nobody cared and in effect it works reliably also this way and it's a
bit simpler... ;).

> Knock yourself out :-)

As you wish! ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>