On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote: > > Ok, as you say you have reproduced this with corruption, I would suggest > > one path for dealing with it although you'll need to pass it by the > > original authors. > > > > When unmapping ranges, there is a check for dirty PTEs in > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid > > writable stale PTEs from CPU0 in a scenario like you laid out above. > > > > madvise_free misses a similar class of check so I'm adding Minchan Kim > > to the cc as the original author of much of that code. Minchan Kim will > > need to confirm but it appears that two modifications would be required. > > The first should pass in the mmu_gather structure to > > madvise_free_pte_range (at minimum) and force flush the TLB under the > > PTL if a dirty PTE is encountered. The second is that it should consider > > OTL: I couldn't read this lengthy discussion so I miss miss something. > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread > in parallel which has stale pte does "store" to make the pte dirty, > it's okay since try_to_unmap_one in shrink_page_list catches the dirty. > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given that the key is that data corruption is avoided, you could argue with a comment that madv_free doesn't necesssarily have to flush it as long as KSM does even if it's clean due to batching. > In above example, I think KSM should flush the TLB, not MADV_FREE and > soft dirty page hander. > That would also be acceptable. > > flushing the full affected range as madvise_free holds mmap_sem for > > read-only to avoid problems with two parallel madv_free operations. The > > second is optional because there are other ways it could also be handled > > that may have lower overhead. > > Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem? > Like madvise(), madv_free can potentially return with a stale PTE visible to the caller that observed a pte_none at the time of madv_free and uses a stale PTE that potentially allows a lost write. It's debatable whether this matters considering that madv_free to a region means that parallel writers can lose their update anyway. It's less of a concern than the KSM angle outlined in Nadav's example which he was able to artifically reproduce by slowing operations to increase the race window. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>