On Tue, May 13, 2014 at 03:44:55PM -0700, Dave Hansen wrote: > On 05/14/2014 02:29 AM, Anthony Iliopoulos wrote: > > The invalidation is required in order to maintain proper semantics > > under CoW conditions. In scenarios where a process clones several > > threads, a thread operating on a core whose DTLB entry for a > > particular hugepage has not been invalidated, will be reading from > > the hugepage that belongs to the forked child process, even after > > hugetlb_cow(). > > > > The thread will not see the updated page as long as the stale DTLB > > entry remains cached, the thread attempts to write into the page, > > the child process exits, or the thread gets migrated to a different > > processor. > > No to be too nitpicky, but this applies to ITLB too, right? Quite true, this does apply to ITLB too. I suppose there might be cases like self-modifying code that touches the private text pages, or some JIT compiler writing on mmap-ed executable regions that would observe the same behavior under similar conditions. > I believe this bug came all the way back from: > > > commit 1e8f889b10d8d2223105719e36ce45688fedbd59 > > Author: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> > > Date: Fri Jan 6 00:10:44 2006 -0800 > > > > [PATCH] Hugetlb: Copy on Write support > > It was probably the first time that we ever changed an _existing_ > hugetlbfs pte, and that patch probably just missed the TLB flush because > none of the other pte-setting hugetlb.c code needed TLB flushes. This seems to be the case, I assume that our internal use case was also probably the first that needed to rely on proper semantics under a multithreaded CoW scenario with hugetlb pages, so this came into the surface. I have actually also wondered about another related thing: even when we actually do invalidate the page, there may still be a race between a thread on a core that reads the page in some tight loop (e.g. on a spinlock), and the page fault handler running on a different core, at the point where the pte is set. Since we invalidate the page via the TLB shootdowns *before* we update the pte (this is true for all do_wp_page(), do_huge_pmd_wp_page() as well as hugetlb_cow()), there may be some tiny window where the thread might re-read the page before the pte is set. I have dismissed this case, since I assume that there are many more cycles spent in servicing the TLB invalidation IPI, walking the pgtable plus other related overhead (e.g. sched) than in updating the pte/pmd so I am not sure how possible it would be to hit this condition. I am also hesitant to simply submit a patch that reverses the order of flushing and setting the pte, due to the following commit: commit 4ce072f1faf29d24df4600f53db8cdd62d400a8f Author: Siddha, Suresh B <suresh.b.siddha@xxxxxxxxx> Date: Fri Sep 29 01:58:42 2006 -0700 [PATCH] mm: fix a race condition under SMC + COW > The bogus x86 version of huge_ptep_clear_flush() came from the s390 > guys, so double-shame on IBM! :P > > > commit 8fe627ec5b7c47b1654dff50536d9709863295a3 > > Author: Gerald Schaefer <gerald.schaefer@xxxxxxxxxx> > > Date: Mon Apr 28 02:13:28 2008 -0700 > > > > hugetlbfs: add missing TLB flush to hugetlb_cow() > > This is probably an opportunity for all the other architecture > maintainers to make sure that they have proper copies of > huge_ptep_clear_flush(). > > I went through the hugetlb code on x86 and couldn't find another TLB > flush that fixes this issue, and I believe this is correct, so: > > Acked-by: Dave Hansen <dave.hansen@xxxxxxxxx> Many thanks for confirming, and for your prompt response. Regards, Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html