On Wed, Sep 23, 2020 at 6:39 AM Gerald Schaefer <gerald.schaefer@xxxxxxxxxxxxx> wrote: > > OK, I can now reproduce this, and unfortunately also with the gup_fast > fix, so it is something different. Bisecting is a bit hard, as it will > not always show immediately, sometimes takes up to an hour. > > Still, I think I found the culprit, merge commit b25d1dc9474e "Merge > branch 'simplify-do_wp_page'". Without those 4 patches, it works fine, > running over night. Odd, but I have a strong suspicion that the "do_wp_page() simplification" only ends up removing serialization that then hides some existing thing. > Not sure why this only shows on s390, should not be architecture-specific, > but we do often see subtle races earlier than others due to hypervisor > impact. Yeah, so if it needs very particular timing, maybe the s390 page table handling together with the hypervisor interfaces ends up being more likely to trigger this, and thus the different timings at do_wp_page() then ends up showing it. > One thing that seems strange to me is that the page flags from the > bad page state output are (uptodate|swapbacked), see below, or > (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC, > that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" > reason. So it seems that the flags have changed between check_free_page() > and __dump_page(), which would be very odd. Or maybe some issue with > compound pages, because __dump_page() looks at head->flags. The odd thing is that all of this _should_ be serialized by the page table lock, as far as I can tell. >From your trace, it looks very much like it's do_madvise() -> zap_pte_range() (your stack trace only has zap_p4d_range mentioned, but all the lower levels are inlined) that races with presumably fast-gup. But zap_pte_range() has the pte lock, and fast-gup is - by design - not allowed to change the page state other than taking a reference to it, and should do that with a "try_get" operation, so even taking the reference should never ever race with somebody doing the final free. IOW, the fast-GUP code does that page = pte_page(pte); head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; if (unlikely(pte_val(pte) != pte_val(*ptep))) { put_compound_head(head, 1, flags); goto pte_unmap; } where the important part is that "try_grab_compound_head()" which does the whole careful atomic "increase page count only if it wasn't zero". See page_cache_add_speculative(). So the rule is - if you hold the page table lock, you can just do "get_page(pte_page())" directly, because you know the pte cannot go away from under you - if you are fast-gup, the pte *can* go away from under you, so you need to do that very careful "get page unless it's gone" dance but I don't see us violating that. There's maybe some interesting memory ordering in the above case, but it does atomic_add_unless() which is ordered, and s390 is strongly ordered anyway, isn't it? (Yes, and it doesn't do the atomic stuff at all if TINY_RCU is set, but that's only set for non-preemptible UP kernels, so that doesn't matter). So if zap_page_range() races with fast-gup, then either zap_page_range() wins the race and puts the page - but then fast-gup won't touch it, or fast-gup wins and gets a reference to the page, and then zap_page_range() will clear it and drop the ref to it, but it won't be the final ref. Your dump seems to show that zap_page_range() *did* drop the final ref, but something is racing with it to the point of actually modifying the page flags. Odd. And the do_wp_page() change itself shouldn't be directly involved, because that's all done under the page table lock. But it obviously does change the page locking a lot, and changes timing a lot. And in fact, the zap_pte_range() code itself doesn't take the page lock (and cannot, because it's all under the page table spinlock). So it does smell like timing to me. But possibly with some s390-specific twist to it. Ooh. One thing that is *very* different about s390 is that it frees the page directly, and doesn't batch things up to happen after the TLB flush. Maybe THAT is the difference? Not that I can tell why it should matter, for all the reasons outlines above. But on x86-64, the __tlb_remove_page() function just adds the page to the "free this later" TLB flush structure, and if it fills up it does the TLB flush and then does the actual batched page freeing outside the page table lock. And that *has* been one of the things that the fast-gup code depended on. We even have a big comment about it: /* * Disable interrupts. The nested form is used, in order to allow * full, general purpose use of this routine. * * With interrupts disabled, we block page table pages from being * freed from under us. See struct mmu_table_batch comments in * include/asm-generic/tlb.h for more details. * * We do not adopt an rcu_read_lock(.) here as we also want to * block IPIs that come from THPs splitting. */ and maybe that whole thing doesn't hold true for s390 at all. Linus