On Wed, 19 Jun 2024, Hugh Dickins wrote: > > and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from > find_lock_entries(). > > Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug > I'm trying to chase even when this series is reverted: Yes, I doubt now that the VM_BUG_ON_FOLIO(!folio_contains) was related to Baolin's series: much more likely to be an instance of other problems. > some kind of page > double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs, > mostly from page reclaim or from exit_mmap(). I'm still getting a feel > for it, maybe it occurs soon enough for a reliable bisection, maybe not. > > (While writing, a run with mm-unstable cut off at 2a9964cc5d27, > drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest, > has not yet hit any problem: too early to tell but promising.) Yes, that ran without trouble for many hours on two machines. I didn't do a formal bisection, but did appear to narrow it down convincingly to Barry's folio_add_new_anon_rmap() series: crashes soon on both machines with Barry's in but Baolin's out, no crashes with both out. Yet while I was studying Barry's patches trying to explain it, one of the machines did at last crash: it's as if Barry's has opened a window which makes these crashes more likely, but not itself to blame. I'll go back to studying that crash now: two CPUs crashed about the same time, perhaps they interacted and give a hint at root cause. (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap() was all about optimizing a known-exclusive case, so it surprises me to see it being extended to non-exclusive; and I worry over how its atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but I've never caught up with David's exclusive changes, I'm out of date). But even if those are wrong, I'd expect them to tend towards a mapped page becoming unreclaimable, then "Bad page map" when munmapped, not to any of the double-free symptoms I've actually seen.) > > And before 2024-06-18, I was working on mm-everything-2024-06-15 minus > Chris Li's mTHP swap series: which worked fairly well, until it locked > up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around > on a page with 0 refcount, while a page table lock is held which one > by one the other CPUs come to want for reclaim. On two machines. I've not seen that symptom at all since 2024-06-15: intriguing, but none of us can afford the time to worry about vanished bugs. Hugh