Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state

Yu Zhao <yuzhao@xxxxxxxxxx> · Fri, 20 Nov 2020 19:49:22 -0700

On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > performed by clear_refs_write() in response to a write to
> > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > state we can end up with entries where pte_write() is false, yet a
> > writable mapping remains in the TLB.
> 
> I don't think we need a TLB flush in this context, same reason as we
> don't have one in copy_present_pte() which uses ptep_set_wrprotect()
> to write-protect a src PTE.
> 
> ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
> either the dirty bit is set (when a PTE is still writable) or a PF
> happens (when a PTE has become r/o) when h/w page table walker races
> with kernel that modifies a PTE using the two APIs.

After we remove the writable bit, if we end up with a clean PTE, any
subsequent write will trigger a page fault. We can't have a stale
writable tlb entry. The architecture-specific APIs guarantee this.

If we end up with a dirty PTE, then yes, there will be a stale
writable tlb entry. But this won't be a problem because when we
write-protect a page (not PTE), we always check both pte_dirty()
and pte_write(), i.e., write_protect_page() and page_mkclean_one().
When they see this dirty PTE, they will flush. And generally, only
callers of pte_mkclean() should flush tlb; otherwise we end up one
extra if callers of pte_mkclean() and pte_wrprotect() both flush.

Now let's take a step back and see why we got
tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95
("mm: fix KSM data corruption") explains the problem clearly. But
to fix a problem created by two threads clearing pte_write() and
pte_dirty() independently, we only need one of them to set
mm_tlb_flush_pending(). Given only removing the writable bit requires
tlb flush, that thread should be the one, as I just explained. Adding
tlb_gather/finish_mmu() is unnecessary in that fix. And there is no
point in having the original flush_tlb_mm() either, given data
integrity is already guaranteed. Of course, with it we have more
accurate access tracking.

Does a similar problem exist for page_mkclean_one()? Possibly. It
checks pte_dirty() and pte_write() but not mm_tlb_flush_pending().
At the moment, madvise_free_pte_range() only supports anonymous
memory, which doesn't do writeback. But the missing
mm_tlb_flush_pending() just seems to be an accident waiting to happen.
E.g., clean_record_pte() calls pte_mkclean() and does batched flush.
I don't know what it's for, but if it's called on file VMAs, a similar
race involving 4 CPUs can happen. This time CPU 1 runs
clean_record_pte() and CPU 3 runs page_mkclean_one().