Re: [QUESTION] Plain dereference and READ_ONCE() in fault handler

Dev Jain <dev.jain@xxxxxxx> · Wed, 5 Mar 2025 20:32:49 +0530

On 05/03/25 4:16 pm, David Hildenbrand wrote:
On 05.03.25 11:21, Dev Jain wrote:
In __handle_mm_fault(),

1. Why is there a barrier() for the PUD logic?

Good question. It was added in

commit a00cc7d9dd93d66a3fb83fc52aa57a4bec51c517
Author: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Date:   Fri Feb 24 14:57:02 2017 -0800

     mm, x86: add support for PUD-sized transparent hugepages

Maybe it was an alternative to performing a READ_ONCE(*vmf.pud).

Maybe it was done that way, because pudp_get_lockless() does not exist. 
And it would likely not be required, because on architectures where 
ptep_get_lockless() does some magic (see below, mostly 32bit), PUD THP 
are not applicable.

Thanks for your reply David.

2. For the PMD logic, in the if block, we use *vmf.pmd, and in the 
else block
    we use pmdp_get_lockless(); what if someone changes the pmd just 
when we
    have begun processing the conditions in the if block, fail in the 
if block
    and then the else block operates on a different pmd value. 
Shouldn't we cache
    the value of the pmd and operate on a single consistent value 
until we take the
    lock and then finally check using pxd_same() and friends?

The pmd_none(*vmf.pmd) is fine. create_huge_pmd() must be able to deal 
with races, because we are not holding any locks.

I had a mental hiccup, yes we don't need the cached value even before 
the if block, as the relevant path will eventually check after taking 
the lock. I was thinking of all sorts of weird races.

We only have to use pmdp_get_lockless() when we want to effectively 
perform a READ_ONCE(), and make sure that we read something "reasonable" 
that we can operate on, even with concurrent changes. (e.g., not read a 
garbage PFN just because of some concurrent changes)

Oh I just looked at the arm64 definition and assumed ptep_get_lockless() 
== READ_ONCE() :) Now this makes sense. So I am guessing that we can 
still get away with a *vmf.pmd on 64-bit arches?

A separate question: When taking the create_huge_pmd() path with a read 
fault and after taking the pmd lock, why shouldn't we check with 
pmd_none(pmdp_get_lockless(vmf.pmd)) instead of plain *vmf.pmd...surely 
now we must load the actual correct value from memory since we are 
committing into mapping the huge zero folio?
And after looking somewhat more, why even is a pmd_none(*pmd) there in 
set_huge_zero_folio()...it should be the responsibility of the caller to 
verify this? Any caller will just assume that it got the huge zero folio 
mapped so this check should be redundant.

We'll store the value in vmf.orig_pmd, on which we'll operate and try to 
detect later changes using pmd_same(). So we really want something 
consistent in there.

See the description above ptep_get_lockless(), why we cannot simply do a 
READ_ONCE on architectures where a PTE cannot be read atomically (e.g., 
8 byte PTEs on 32bit architecture).