On Fri, Nov 20, 2020 at 07:49:22PM -0700, Yu Zhao wrote: > On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote: > > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote: > > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"), > > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched > > > via the tlb_remove_*() functions. Consequently, the page-table modifications > > > performed by clear_refs_write() in response to a write to > > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is > > > fine when simply aging the ptes, in the case of clearing the "soft-dirty" > > > state we can end up with entries where pte_write() is false, yet a > > > writable mapping remains in the TLB. I double checked my conclusion and I think it holds. But let me correct some typos and add a summary. > > I don't think we need a TLB flush in this context, same reason as we ^^^^^ gather > > don't have one in copy_present_pte() which uses ptep_set_wrprotect() > > to write-protect a src PTE. > > > > ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee > > either the dirty bit is set (when a PTE is still writable) or a PF > > happens (when a PTE has become r/o) when h/w page table walker races > > with kernel that modifies a PTE using the two APIs. > > After we remove the writable bit, if we end up with a clean PTE, any > subsequent write will trigger a page fault. We can't have a stale > writable tlb entry. The architecture-specific APIs guarantee this. > > If we end up with a dirty PTE, then yes, there will be a stale > writable tlb entry. But this won't be a problem because when we > write-protect a page (not PTE), we always check both pte_dirty() > and pte_write(), i.e., write_protect_page() and page_mkclean_one(). > When they see this dirty PTE, they will flush. And generally, only > callers of pte_mkclean() should flush tlb; otherwise we end up one > extra if callers of pte_mkclean() and pte_wrprotect() both flush. > > Now let's take a step back and see why we got > tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95 > ("mm: fix KSM data corruption") explains the problem clearly. But > to fix a problem created by two threads clearing pte_write() and > pte_dirty() independently, we only need one of them to set > mm_tlb_flush_pending(). Given only removing the writable bit requires ^^^^^^^^ dirty > tlb flush, that thread should be the one, as I just explained. Adding > tlb_gather/finish_mmu() is unnecessary in that fix. And there is no > point in having the original flush_tlb_mm() either, given data > integrity is already guaranteed. (i.e., writable tlb entries are flushed when removing the dirty bit.) > Of course, with it we have more accurate access tracking. > > Does a similar problem exist for page_mkclean_one()? Possibly. It > checks pte_dirty() and pte_write() but not mm_tlb_flush_pending(). > At the moment, madvise_free_pte_range() only supports anonymous > memory, which doesn't do writeback. But the missing > mm_tlb_flush_pending() just seems to be an accident waiting to happen. > E.g., clean_record_pte() calls pte_mkclean() and does batched flush. > I don't know what it's for, but if it's called on file VMAs, a similar > race involving 4 CPUs can happen. This time CPU 1 runs > clean_record_pte() and CPU 3 runs page_mkclean_one(). To summarize, IMO, we should 1) remove tlb_gather/finish_mmu() here; 2) check mm_tlb_flush_pending() in page_mkclean_one() and dax_entry_mkclean().