On Wednesday 14 June 2017 10:25 PM, Will Deacon wrote:
Hi Aneesh,
On Wed, Jun 14, 2017 at 08:55:26PM +0530, Aneesh Kumar K.V wrote:
On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().
The bug doesn't lead to user-visible misbehaviour in current kernel, but
fixing this would be critical for future work on THP: both huge-ext4 and THP
swap out rely on proper dirty tracking.
Unfortunately, there's no way to address the issue in a generic way. We need to
fix all architectures that support THP one-by-one.
All architectures that have THP supported have to provide atomic
pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
architecture needs to provide atomic pmdp_mknonpresent().
I've fixed the issue for x86, but I need help with the rest.
So far THP is supported on 8 architectures. Power and S390 already provides
atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
left:
- arc;
- arm;
- arm64;
- mips;
- sparc -- it has custom pmdp_invalidate(), but it's racy too;
Please, help me with them.
Kirill A. Shutemov (3):
x86/mm: Provide pmdp_mknotpresent() helper
mm: Do not loose dirty and access bits in pmdp_invalidate()
mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
But in __split_huge_pmd_locked() we collected the dirty bit early. So even
if we made pmdp_invalidate() atomic, if we had marked the pmd pte entry
dirty after we collected the dirty bit, we still loose it right ?
May be we should relook at pmd PTE udpate interface. We really need an
interface that can update pmd entries such that we don't clear it in
between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
need this interface to avoid the madvise race fixed by
There's a good chance I'm not following your suggestion here, but it's
probably worth me pointing out that swizzling a page table entry from a
block mapping (e.g. a huge page mapped at the PMD level) to a table entry
(e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
problems on ARM, including amalgamation of TLB entries and fatal aborts.
So we really need to go via an invalid entry, with appropriate TLB
invalidation before installing the new entry.
I am not suggesting we don't do the invalidate (the need for that is
documented in __split_huge_pmd_locked(). I am suggesting we need a new
interface, something like Andrea suggested.
old_pmd = pmdp_establish(pmd_mknotpresent());
instead of pmdp_invalidate(). We can then use this in scenarios where we
want to update pmd PTE entries, where right now we go through a
pmdp_clear and set_pmd path. We should really not do that for THP entries.
W.r.t pmdp_invalidate() usage, I was wondering whether we can do that
early in __split_huge_pmd_locked().
-aneesh