Re: Potential race in TLB flush batching?

Minchan Kim <minchan@xxxxxxxxxx> · Thu, 27 Jul 2017 08:44:54 +0900

Hi Mel,

On Wed, Jul 26, 2017 at 10:22:28AM +0100, Mel Gorman wrote:
> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
> > > I'm relying on the fact you are the madv_free author to determine if
> > > it's really necessary. The race in question is CPU 0 running madv_free
> > > and updating some PTEs while CPU 1 is also running madv_free and looking
> > > at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> > > the pte_dirty check (because CPU 0 has updated it already) and potentially
> > > fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> > > potentially writable TLB entries and the underlying PTE is still present
> > > so that a subsequent write does not necessarily propagate the dirty bit
> > > to the underlying PTE any more. Reclaim at some unknown time at the future
> > > may then see that the PTE is still clean and discard the page even though
> > > a write has happened in the meantime. I think this is possible but I could
> > > have missed some protection in madv_free that prevents it happening.
> > 
> > Thanks for the detail. You didn't miss anything. It can happen and then
> > it's really bug. IOW, if application does write something after madv_free,
> > it must see the written value, not zero.
> > 
> > How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
> > With it, when tlb_finish_mmu is called, we can know we skip the flush
> > but there is pending flush, so flush focefully to avoid madv_dontneed
> > as well as madv_free scenario.
> > 
> 
> I *think* this is ok as it's simply more expensive on the KSM side in
> the event of a race but no other harmful change is made assuming that
> KSM is the only race-prone. The check for mm_tlb_flush_pending also
> happens under the PTL so there should be sufficient protection from the
> mm struct update being visible at teh right time.
> 
> Check using the test program from "mm: Always flush VMA ranges affected
> by zap_page_range v2" if it handles the madvise case as well as that
> would give some degree of safety. Make sure it's tested against 4.13-rc2
> instead of mmotm which already includes the madv_dontneed fix. If yours
> works for both then it supersedes the mmotm patch.

Okay, I will test it on 4.13-rc2 + Nadav's atomic tlb_flush_pending
+ my patch fixed partial flush problem pointed out by Nadav.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>