Hi Steve, [adding linux-mm, since this has turned into a discussion about THP splitting] On Fri, Oct 04, 2013 at 11:31:42AM +0100, Steve Capper wrote: > On Thu, Oct 03, 2013 at 11:07:44AM -0700, Zi Shen Lim wrote: > > On Thu, Oct 3, 2013 at 10:27 AM, Will Deacon <will.deacon@xxxxxxx> wrote: > > > On Thu, Oct 03, 2013 at 06:15:15PM +0100, Zi Shen Lim wrote: > > >> Futex uses GUP. Currently on ARM, the default __get_user_pages_fast > > >> being used always returns 0, leading to a forever loop in get_futex_key :( > > > > > > This looks pretty much like an exact copy of the x86 version, which will > > > likely also result in another exact copy for arm64. Can none of this code be > > > made common? Furthermore, the fact that you've lifted the code and not > > > provided much of an explanation in the cover letter hints that you might not > > > be aware of all the subtleties involved here... [...] > > >> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > > >> + int write, struct page **pages, int *nr) > > >> +{ > > >> + unsigned long next; > > >> + pmd_t *pmdp; > > >> + > > >> + pmdp = pmd_offset(&pud, addr); > > >> + do { > > >> + pmd_t pmd = *pmdp; > > >> + > > >> + next = pmd_addr_end(addr, end); > > >> + /* > > >> + * The pmd_trans_splitting() check below explains why > > >> + * pmdp_splitting_flush has to flush the tlb, to stop > > >> + * this gup-fast code from running while we set the > > >> + * splitting bit in the pmd. Returning zero will take > > >> + * the slow path that will call wait_split_huge_page() > > >> + * if the pmd is still in splitting state. gup-fast > > >> + * can't because it has irq disabled and > > >> + * wait_split_huge_page() would never return as the > > >> + * tlb flush IPI wouldn't run. > > >> + */ > > >> + if (pmd_none(pmd) || pmd_trans_splitting(pmd)) > > >> + return 0; > > >> + if (unlikely(pmd_huge(pmd))) { > > >> + if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) > > >> + return 0; > > >> + } else { > > >> + if (!gup_pte_range(pmd, addr, next, write, pages, nr)) > > >> + return 0; > > >> + } > > >> + } while (pmdp++, addr = next, addr != end); > > > > > > ...case in point: we don't (usually) require IPIs to shoot down TLB entries > > > in SMP systems, so this is racy under thp splitting. [...] > As Will pointed out, ARM does not usually require IPIs to shoot down TLB > entries. So the local_irq_disable will not necessarily block pagetables being > freed when fast_gup is running. > > Transparent huge pages when splitting will set the pmd splitting bit then > perform a tlb invalidate, then proceed with the split. Thus a splitting THP > will not always be blocked by local_irq_disable on ARM. This does not only > affect fast_gup, futexes are also affected. From my understanding of futex on > THP tail case in kernel/futex.c, it looks like an assumption is made there also > that splitting pmds can be blocked by disabling local irqs. > > PowerPC and SPARC, like ARM do not necessarily require IPIs for TLB shootdown > either so they make use of tlb_remove_table (CONFIG_HAVE_RCU_TABLE_FREE). This > identifies pages backing pagetables that have multiple users and batches them > up, and then performs a dummy IPI before freeing them en masse. This reduces > the performance impact from the IPIs (by doing considerably fewer of them), and > guarantees that pagetables cannot be freed from under the fast_gup. > Unfortunately this also means that the fast_gup has to be aware of ptes/pmds > changing from under it. [...] > There's also the possibility of blocking without an IPI, but it's not obvious > to me how to do that (that would probably necessitate a change to > kernel/futex.c). I've just picked this up recently and am still trying to > understand it fully. The IPI solution looks like a hack to me and essentially moves the synchronisation down into the csd_lock on the splitting side as part of the cross-call to invalidate the TLB. Furthermore, the TLB doesn't even need to be invalidated afaict, since we're just updating software bits. Instead, I wonder whether this can be solved with a simple atomic_t: - The fast GUP code (read side) does something like: atomic_inc(readers); smp_mb__after_atomic_inc(); __get_user_pages_fast(...); smp_mb__before_atomic_dec(); atomic_dec(readers); - The splitting code (write side) then polls the counter to reach zero: pmd_t pmd = pmd_mksplitting(*pmdp); set_pmd_at(vma->vm_mm, address, pmdp, pmd); smp_mb(); while (atomic_read(readers) != 0) cpu_relax(); that way, we don't need to worry about IPIs, we don't need to disable interrupts on the read side and we still get away without heavyweight locking. Of course, I could well be missing something here, but what we currently have in mainline doesn't work for ARM anyway (since TLB invalidation is broadcast in hardware). Will -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>