On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > >>> Do we need a Fixes tag? > > > > I am not quite sure which commit should be here for a fixes tag. > > I think it's more of an optimization. > > Good, that helps! > > > > >>> > >> > >> What would be the description of the problem we are fixing? > >> > >> 1) failing to unmap? > >> > >> That can happen with small folios as well IIUC. > >> > >> 2) Putting the large folio on the deferred split queue? > >> > >> That sounds more reasonable. > > > > I don't feel it is reasonable. Avoiding this kind of accident splitting > > from the kernel's improper code is a more reasonable approach > > as there is always a price to pay for splitting and unfolding PTEs > > etc. > > > > While we can't avoid splitting coming from userspace's > > MADV_DONTNEED, munmap, mprotect, we have a way > > to ensure the kernel itself doesn't accidently break up a > > large folio. > > Note that on the next vmscan we would retry, find the remaining present > entries and swapout that thing completely :) This is true, but since we can finish the job the first time, it seems second retry is a cost :-) > > > > > In OPPO's phones, we ran into some weird bugs due to skipped PTEs > > in try_to_unmap_one. hardly could we fix it from the root cause. with > > various races, figuring out their timings was really a big pain :-) > > > > I can imagine. I assume, though, that it might be related to the way the > cont-pte bit was handled. Ryan's implementation should be able to cope > with that. I guess you are probably right. Ryan's implementation decouples CONT-PTE from mm core. nice to have it. > > > But we did "resolve" those bugs by entirely untouching all PTEs if we > > found some PTEs were skipped in try_to_unmap_one [1]. > > > > While we find we only get the PTL from 2nd, 3rd but not > > 1st PTE, we entirely give up on try_to_unmap_one, and leave > > all PTEs untouched. > > > > /* we are not starting from head */ > > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { > > ret = false; > > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); > > set_pte_at(mm, address, pvmw.pte, pteval); > > page_vma_mapped_walk_done(&pvmw); > > break; > > } > > This will ensure all PTEs still have a unified state such as CONT-PTE > > after try_to_unmap fails. > > I feel this could have some false postive because when racing > > with unmap, 1st PTE might really become pte_none. So explicitly > > holding PTL from 1st PTE seems a better way. > > Can we estimate the "cost" of holding the PTL? > This is just moving PTL acquisition one or two PTE earlier in those corner cases. In normal cases, it doesn't affect when PTL is held. In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold PTL immediately. in corner cases, page_vma_mapped_walk races with break- before-make, after skipping one or two PTEs whose states are transferring, it will find a present pte then acquire lock. > -- > Cheers, > > David / dhildenb Thanks Barry