On 3/22/2023 8:03 PM, Ryan Roberts wrote: > Hi Matthew, > > On 17/03/2023 10:57, Ryan Roberts wrote: >> Hi All, >> >> [...] >> >> Bug(s) >> ====== >> >> When I run this code without the last (workaround) patch, with DEBUG_VM et al, >> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are >> relating to invalid kernel addresses (which usually look like either NULL + >> small offset or mostly zeros with a few mid-order bits set + a small offset) or >> lockdep complaining about a bad unlock balance. Call stacks are often in >> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can >> email example oopses out separately if anyone wants to review them). My hunch is >> that struct pages adjacent to the folio are being corrupted, but don't have hard >> evidence. >> >> When adding the workaround patch, which prevents madvise_free_pte_range() from >> attempting to split a large folio, I never see any issues. Although I'm not >> putting the system under memory pressure so guess I might see the same types of >> problem crop up under swap, etc. >> >> I've reviewed most of the code within split_folio() and can't find any smoking >> gun, but I wonder if there are implicit assumptions about the large folio being >> PMD sized that I'm obviously breaking now? >> >> The code in madvise_free_pte_range(): >> >> if (folio_test_large(folio)) { >> if (folio_mapcount(folio) != 1) >> goto out; >> folio_get(folio); >> if (!folio_trylock(folio)) { >> folio_put(folio); >> goto out; >> } >> pte_unmap_unlock(orig_pte, ptl); >> if (split_folio(folio)) { >> folio_unlock(folio); >> folio_put(folio); >> orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); >> goto out; >> } >> ... >> } > > I've noticed that its folio_split() with a folio order of 1 that causes my > problems. And I also see that the page cache code always explicitly never > allocates order-1 folios: > > void page_cache_ra_order(struct readahead_control *ractl, > struct file_ra_state *ra, unsigned int new_order) > { > ... > > while (index <= limit) { > unsigned int order = new_order; > > /* Align with smaller pages if needed */ > if (index & ((1UL << order) - 1)) { > order = __ffs(index); > if (order == 1) > order = 0; > } > /* Don't allocate pages past EOF */ > while (index + (1UL << order) - 1 > limit) { > if (--order == 1) > order = 0; > } > err = ra_alloc_folio(ractl, index, mark, order, gfp); > if (err) > break; > index += 1UL << order; > } > > ... > } > > Matthew, what is the reason for this? I suspect its guarding against the same > problem I'm seeing. > > If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause > any oops/panic/etc. I'd just like to understand the root cause. Checked the struct folio definition. The _deferred_list is in third page struct. My understanding is to support folio split, the folio order must >= 2. Thanks. Regards Yin, Fengwei > > Thanks, > Ryan > > > >> >> Will normally skip my large folios because they have a mapcount > 1, due to >> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it >> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim >> or CoW in this case? >> >> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my >> large anon folio is not using the folio lock in the same way as a THP would and >> we are therefore not getting the expected serialization? >> >> I'd really appreciate any suggestions for how to pregress here! >> >