Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state

Yu Zhao <yuzhao@xxxxxxxxxx> · Mon, 23 Nov 2020 12:21:16 -0700

On Fri, Nov 20, 2020 at 07:49:22PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote:
> > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > performed by clear_refs_write() in response to a write to
> > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > state we can end up with entries where pte_write() is false, yet a
> > > writable mapping remains in the TLB.

I double checked my conclusion and I think it holds. But let me
correct some typos and add a summary.

> > I don't think we need a TLB flush in this context, same reason as we
                                ^^^^^ gather

> > don't have one in copy_present_pte() which uses ptep_set_wrprotect()
> > to write-protect a src PTE.
> > 
> > ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
> > either the dirty bit is set (when a PTE is still writable) or a PF
> > happens (when a PTE has become r/o) when h/w page table walker races
> > with kernel that modifies a PTE using the two APIs.
> 
> After we remove the writable bit, if we end up with a clean PTE, any
> subsequent write will trigger a page fault. We can't have a stale
> writable tlb entry. The architecture-specific APIs guarantee this.
> 
> If we end up with a dirty PTE, then yes, there will be a stale
> writable tlb entry. But this won't be a problem because when we
> write-protect a page (not PTE), we always check both pte_dirty()
> and pte_write(), i.e., write_protect_page() and page_mkclean_one().
> When they see this dirty PTE, they will flush. And generally, only
> callers of pte_mkclean() should flush tlb; otherwise we end up one
> extra if callers of pte_mkclean() and pte_wrprotect() both flush.
> 
> Now let's take a step back and see why we got
> tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95
> ("mm: fix KSM data corruption") explains the problem clearly. But
> to fix a problem created by two threads clearing pte_write() and
> pte_dirty() independently, we only need one of them to set
> mm_tlb_flush_pending(). Given only removing the writable bit requires
                                                  ^^^^^^^^ dirty

> tlb flush, that thread should be the one, as I just explained. Adding
> tlb_gather/finish_mmu() is unnecessary in that fix. And there is no
> point in having the original flush_tlb_mm() either, given data
> integrity is already guaranteed.
(i.e., writable tlb entries are flushed when removing the dirty bit.)

> Of course, with it we have more accurate access tracking.
> 
> Does a similar problem exist for page_mkclean_one()? Possibly. It
> checks pte_dirty() and pte_write() but not mm_tlb_flush_pending().
> At the moment, madvise_free_pte_range() only supports anonymous
> memory, which doesn't do writeback. But the missing
> mm_tlb_flush_pending() just seems to be an accident waiting to happen.
> E.g., clean_record_pte() calls pte_mkclean() and does batched flush.
> I don't know what it's for, but if it's called on file VMAs, a similar
> race involving 4 CPUs can happen. This time CPU 1 runs
> clean_record_pte() and CPU 3 runs page_mkclean_one().

To summarize, IMO, we should 1) remove tlb_gather/finish_mmu() here;
2) check mm_tlb_flush_pending() in page_mkclean_one() and
dax_entry_mkclean().