On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote: > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote: > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"), > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched > > via the tlb_remove_*() functions. Consequently, the page-table modifications > > performed by clear_refs_write() in response to a write to > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is > > fine when simply aging the ptes, in the case of clearing the "soft-dirty" > > state we can end up with entries where pte_write() is false, yet a > > writable mapping remains in the TLB. > > I don't think we need a TLB flush in this context, same reason as we > don't have one in copy_present_pte() which uses ptep_set_wrprotect() > to write-protect a src PTE. > > ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee > either the dirty bit is set (when a PTE is still writable) or a PF > happens (when a PTE has become r/o) when h/w page table walker races > with kernel that modifies a PTE using the two APIs. After we remove the writable bit, if we end up with a clean PTE, any subsequent write will trigger a page fault. We can't have a stale writable tlb entry. The architecture-specific APIs guarantee this. If we end up with a dirty PTE, then yes, there will be a stale writable tlb entry. But this won't be a problem because when we write-protect a page (not PTE), we always check both pte_dirty() and pte_write(), i.e., write_protect_page() and page_mkclean_one(). When they see this dirty PTE, they will flush. And generally, only callers of pte_mkclean() should flush tlb; otherwise we end up one extra if callers of pte_mkclean() and pte_wrprotect() both flush. Now let's take a step back and see why we got tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95 ("mm: fix KSM data corruption") explains the problem clearly. But to fix a problem created by two threads clearing pte_write() and pte_dirty() independently, we only need one of them to set mm_tlb_flush_pending(). Given only removing the writable bit requires tlb flush, that thread should be the one, as I just explained. Adding tlb_gather/finish_mmu() is unnecessary in that fix. And there is no point in having the original flush_tlb_mm() either, given data integrity is already guaranteed. Of course, with it we have more accurate access tracking. Does a similar problem exist for page_mkclean_one()? Possibly. It checks pte_dirty() and pte_write() but not mm_tlb_flush_pending(). At the moment, madvise_free_pte_range() only supports anonymous memory, which doesn't do writeback. But the missing mm_tlb_flush_pending() just seems to be an accident waiting to happen. E.g., clean_record_pte() calls pte_mkclean() and does batched flush. I don't know what it's for, but if it's called on file VMAs, a similar race involving 4 CPUs can happen. This time CPU 1 runs clean_record_pte() and CPU 3 runs page_mkclean_one().