On 22/03/2023 13:36, Yin, Fengwei wrote: > > > On 3/22/2023 8:03 PM, Ryan Roberts wrote: >> Hi Matthew, >> >> On 17/03/2023 10:57, Ryan Roberts wrote: >>> Hi All, >>> >>> [...] >>> >>> Bug(s) >>> ====== >>> >>> When I run this code without the last (workaround) patch, with DEBUG_VM et al, >>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are >>> relating to invalid kernel addresses (which usually look like either NULL + >>> small offset or mostly zeros with a few mid-order bits set + a small offset) or >>> lockdep complaining about a bad unlock balance. Call stacks are often in >>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can >>> email example oopses out separately if anyone wants to review them). My hunch is >>> that struct pages adjacent to the folio are being corrupted, but don't have hard >>> evidence. >>> >>> When adding the workaround patch, which prevents madvise_free_pte_range() from >>> attempting to split a large folio, I never see any issues. Although I'm not >>> putting the system under memory pressure so guess I might see the same types of >>> problem crop up under swap, etc. >>> >>> I've reviewed most of the code within split_folio() and can't find any smoking >>> gun, but I wonder if there are implicit assumptions about the large folio being >>> PMD sized that I'm obviously breaking now? >>> >>> The code in madvise_free_pte_range(): >>> >>> if (folio_test_large(folio)) { >>> if (folio_mapcount(folio) != 1) >>> goto out; >>> folio_get(folio); >>> if (!folio_trylock(folio)) { >>> folio_put(folio); >>> goto out; >>> } >>> pte_unmap_unlock(orig_pte, ptl); >>> if (split_folio(folio)) { >>> folio_unlock(folio); >>> folio_put(folio); >>> orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); >>> goto out; >>> } >>> ... >>> } >> >> I've noticed that its folio_split() with a folio order of 1 that causes my >> problems. And I also see that the page cache code always explicitly never >> allocates order-1 folios: >> >> void page_cache_ra_order(struct readahead_control *ractl, >> struct file_ra_state *ra, unsigned int new_order) >> { >> ... >> >> while (index <= limit) { >> unsigned int order = new_order; >> >> /* Align with smaller pages if needed */ >> if (index & ((1UL << order) - 1)) { >> order = __ffs(index); >> if (order == 1) >> order = 0; >> } >> /* Don't allocate pages past EOF */ >> while (index + (1UL << order) - 1 > limit) { >> if (--order == 1) >> order = 0; >> } >> err = ra_alloc_folio(ractl, index, mark, order, gfp); >> if (err) >> break; >> index += 1UL << order; >> } >> >> ... >> } >> >> Matthew, what is the reason for this? I suspect its guarding against the same >> problem I'm seeing. >> >> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause >> any oops/panic/etc. I'd just like to understand the root cause. > Checked the struct folio definition. The _deferred_list is in third page struct. > My understanding is to support folio split, the folio order must >= 2. Thanks. Yep, looks like we have found the root cause - thanks for your help! I've updated calc_anonymous_folio_order() to only use non-0 order if THP is available and in that case, never allocate order-1. I think that both fixes the problem and manages the dependency we have on THP: static void calc_anonymous_folio_order(struct vm_fault *vmf, int *order_out, unsigned long *addr_out) { /* * The aim here is to determine what size of folio we should allocate * for this fault. Factors include: * - Folio must be naturally aligned within VA space * - Folio must not breach boundaries of vma * - Folio must be fully contained inside one pmd entry * - Folio must not overlap any non-none ptes * - Order must not be higher than *order_out upon entry * * Additionally, we do not allow order-1 since this breaks assumptions * elsewhere in the mm; THP pages must be at least order-2 (since they * store state up to the 3rd struct page subpage), and these pages must * be THP in order to correctly use pre-existing THP infrastructure such * as folio_split(). * * As a consequence of relying on the THP infrastructure, if the system * does not support THP, we always fallback to order-0. * * Note that the caller may or may not choose to lock the pte. If * unlocked, the calculation should be considered an estimate that will * need to be validated under the lock. */ struct vm_area_struct *vma = vmf->vma; int nr; int order; unsigned long addr; pte_t *pte; pte_t *first_set = NULL; int ret; if (has_transparent_hugepage()) { order = min(*order_out, PMD_SHIFT - PAGE_SHIFT); for (; order > 1; order--) { nr = 1 << order; addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); /* Check vma bounds. */ if (addr < vma->vm_start || addr + nr * PAGE_SIZE > vma->vm_end) continue; /* Ptes covered by order already known to be none. */ if (pte + nr <= first_set) break; /* Already found set pte in range covered by order. */ if (pte <= first_set) continue; /* Need to check if all the ptes are none. */ ret = check_all_ptes_none(pte, nr); if (ret == nr) break; first_set = pte + ret; } if (order == 1) order = 0; } else { order = 0; } *order_out = order; *addr_out = order > 0 ? addr : vmf->address; } > > > Regards > Yin, Fengwei > >> >> Thanks, >> Ryan >> >> >> >>> >>> Will normally skip my large folios because they have a mapcount > 1, due to >>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it >>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim >>> or CoW in this case? >>> >>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my >>> large anon folio is not using the folio lock in the same way as a THP would and >>> we are therefore not getting the expected serialization? >>> >>> I'd really appreciate any suggestions for how to pregress here! >>> >>