On Tue, Jul 25, 2017 at 09:51:32AM +0100, Mel Gorman wrote: > On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote: > > > Ok, as you say you have reproduced this with corruption, I would suggest > > > one path for dealing with it although you'll need to pass it by the > > > original authors. > > > > > > When unmapping ranges, there is a check for dirty PTEs in > > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid > > > writable stale PTEs from CPU0 in a scenario like you laid out above. > > > > > > madvise_free misses a similar class of check so I'm adding Minchan Kim > > > to the cc as the original author of much of that code. Minchan Kim will > > > need to confirm but it appears that two modifications would be required. > > > The first should pass in the mmu_gather structure to > > > madvise_free_pte_range (at minimum) and force flush the TLB under the > > > PTL if a dirty PTE is encountered. The second is that it should consider > > > > OTL: I couldn't read this lengthy discussion so I miss miss something. > > > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE > > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread > > in parallel which has stale pte does "store" to make the pte dirty, > > it's okay since try_to_unmap_one in shrink_page_list catches the dirty. > > > > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given > that the key is that data corruption is avoided, you could argue with a > comment that madv_free doesn't necesssarily have to flush it as long as > KSM does even if it's clean due to batching. Yes, I think it should be done in side where have a concern. Maybe, mm_struct can carry a flag which indicates someone is doing the TLB bacthing and then KSM side can flush it by the flag. It would reduce unncessary flushing. > > > In above example, I think KSM should flush the TLB, not MADV_FREE and > > soft dirty page hander. > > > > That would also be acceptable. > > > > flushing the full affected range as madvise_free holds mmap_sem for > > > read-only to avoid problems with two parallel madv_free operations. The > > > second is optional because there are other ways it could also be handled > > > that may have lower overhead. > > > > Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem? > > > > Like madvise(), madv_free can potentially return with a stale PTE visible > to the caller that observed a pte_none at the time of madv_free and uses > a stale PTE that potentially allows a lost write. It's debatable whether That is the part I cannot understand. How does it lost "the write"? MADV_FREE doesn't discard the memory so finally, the write should be done sometime. Could you tell me more? Thanks. > this matters considering that madv_free to a region means that parallel > writers can lose their update anyway. It's less of a concern than the > KSM angle outlined in Nadav's example which he was able to artifically > reproduce by slowing operations to increase the race window. > > -- > Mel Gorman > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>