Vlastimil Babka <vbabka@xxxxxxx> writes: > On 12/16/19 2:47 PM, Peter Zijlstra wrote: >> On Mon, Dec 16, 2019 at 02:30:44PM +0100, Vitaly Kuznetsov wrote: >>> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes: >>> >>>> On Tue, Dec 17, 2019 at 02:15:48AM +0530, Ajay Kaher wrote: >>>>> From: Vlastimil Babka <vbabka@xxxxxxx> >>>>> >>>>> The x86 version of get_user_pages_fast() relies on disabled interrupts to >>>>> synchronize gup_pte_range() between gup_get_pte(ptep); and get_page() against >>>>> a parallel munmap. The munmap side nulls the pte, then flushes TLBs, then >>>>> releases the page. As TLB flush is done synchronously via IPI disabling >>>>> interrupts blocks the page release, and get_page(), which assumes existing >>>>> reference on page, is thus safe. >>>>> However when TLB flush is done by a hypercall, e.g. in a Xen PV guest, there is >>>>> no blocking thanks to disabled interrupts, and get_page() can succeed on a page >>>>> that was already freed or even reused. >>>>> >>>>> We have recently seen this happen with our 4.4 and 4.12 based kernels, with >>>>> userspace (java) that exits a thread, where mm_release() performs a futex_wake() >>>>> on tsk->clear_child_tid, and another thread in parallel unmaps the page where >>>>> tsk->clear_child_tid points to. The spurious get_page() succeeds, but futex code >>>>> immediately releases the page again, while it's already on a freelist. Symptoms >>>>> include a bad page state warning, general protection faults acessing a poisoned >>>>> list prev/next pointer in the freelist, or free page pcplists of two cpus joined >>>>> together in a single list. Oscar has also reproduced this scenario, with a >>>>> patch inserting delays before the get_page() to make the race window larger. >>>>> >>>>> Fix this by removing the dependency on TLB flush interrupts the same way as the >>>> >>>> This is suppsed to be fixed by: >>>> >>>> arch/x86/Kconfig: select HAVE_RCU_TABLE_FREE if PARAVIRT >>>> >>> >>> Yes, > > Well, that commit fixes the "page table can be freed under us" part. But > this patch is about the "get_page() will succeed on a page that's being > freed" part. Upstream fixed that unknowingly in 4.13 by a gup.c > refactoring that would be too risky to backport fully. > (I also dislike receiving only this patch of the series, next time please send the whole thing, it's only 8 patches, our mailfolders will survive that) When I was adding Hyper-V PV TLB flush to RHEL7 - which is 3.10 based - in addition to adding page_cache_get_speculative() to gup_get_pte()/gup_huge_pmd()/gup_huge_pud() I also had to synchronize huge PMD split against gup_fast with the following hack: +static void do_nothing(void *unused) +{ + +} + +static void serialize_against_pte_lookup(struct mm_struct *mm) +{ + smp_mb(); + smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); +} + void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { @@ -434,9 +473,10 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, set = !test_and_set_bit(_PAGE_BIT_SPLITTING, (unsigned long *)pmdp); if (set) { /* need tlb flush only to serialize against gup-fast */ flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + if (pv_mmu_ops.flush_tlb_others != native_flush_tlb_others) + serialize_against_pte_lookup(vma->vm_mm); } } I'm not sure which stable kernel you're targeting (and if you addressed this with other patches in the series, if this is needed,...) so JFYI. -- Vitaly