On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 04.03.24 14:03, Ryan Roberts wrote: > > On 04/03/2024 12:41, David Hildenbrand wrote: > >> On 04.03.24 13:20, Ryan Roberts wrote: > >>> Hi Barry, > >>> > >>> On 04/03/2024 10:37, Barry Song wrote: > >>>> From: Barry Song <v-songbaohua@xxxxxxxx> > >>>> > >>>> page_vma_mapped_walk() within try_to_unmap_one() races with other > >>>> PTEs modification such as break-before-make, while iterating PTEs > >>>> of a large folio, it will only begin to acquire PTL after it gets > >>>> a valid(present) PTE. break-before-make intermediately sets PTEs > >>>> to pte_none. Thus, a large folio's PTEs might be partially skipped > >>>> in try_to_unmap_one(). > >>> > >>> I just want to check my understanding here - I think the problem occurs for > >>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > >>> that I've had a look at the code and have a better understanding, I think that > >>> must be the case? And therefore this problem exists independently of my work to > >>> support swap-out of mTHP? (From your previous report I was under the impression > >>> that it only affected mTHP). > >>> > >>> Its just that the problem is becoming more pronounced because with mTHP, > >>> PTE-mapped large folios are much more common? > >> > >> That is my understanding. > >> > >>> > >>>> For example, for an anon folio, after try_to_unmap_one(), we may > >>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > >>>> So folio will be still mapped, the folio fails to be reclaimed. > >>>> What’s even more worrying is, its PTEs are no longer in a unified > >>>> state. This might lead to accident folio_split() afterwards. And > >>>> since a part of PTEs are now swap entries, accessing them will > >>>> incur page fault - do_swap_page. > >>>> It creates both anxiety and more expense. While we can't avoid > >>>> userspace's unmap to break up unified PTEs such as CONT-PTE for > >>>> a large folio, we can indeed keep away from kernel's breaking up > >>>> them due to its code design. > >>>> This patch is holding PTL from PTE0, thus, the folio will either > >>>> be entirely reclaimed or entirely kept. On the other hand, this > >>>> approach doesn't increase PTL contention. Even w/o the patch, > >>>> page_vma_mapped_walk() will always get PTL after it sometimes > >>>> skips one or two PTEs because intermediate break-before-makes > >>>> are short, according to test. Of course, even w/o this patch, > >>>> the vast majority of try_to_unmap_one still can get PTL from > >>>> PTE0. This patch makes the number 100%. > >>>> The other option is that we can give up in try_to_unmap_one > >>>> once we find PTE0 is not the first entry we get PTL, we call > >>>> page_vma_mapped_walk_done() to end the iteration at this case. > >>>> This will keep the unified PTEs while the folio isn't reclaimed. > >>>> The result is quite similar with small folios with one PTE - > >>>> either entirely reclaimed or entirely kept. > >>>> Reclaiming large folios by holding PTL from PTE0 seems a better > >>>> option comparing to giving up after detecting PTL begins from > >>>> non-PTE0. > >>>> > >> > >> I'm sure that wall of text can be formatted in a better way :) . Also, I think > >> we can drop some of the details, > >> > >> If you need some inspiration, I can give it a shot. > >> > >>>> Cc: Hugh Dickins <hughd@xxxxxxxxxx> > >>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > >>> > >>> Do we need a Fixes tag? > >>> > >> > >> What would be the description of the problem we are fixing? > >> > >> 1) failing to unmap? > >> > >> That can happen with small folios as well IIUC. > >> > >> 2) Putting the large folio on the deferred split queue? > >> > >> That sounds more reasonable. > > > > Isn't the real problem today that we can end up writng a THP to the swap file > > (so 2M more IO and space used) but we can't remove it from memory, so no actual > > reclaim happens? Although I guess your (2) is really just another way of saying > > that. > > The same could happen with small folios I believe? We might end up > running into the > > folio_mapped() > > after the try_to_unmap(). > > Note that the actual I/O does not happen during add_to_swap(), but > during the pageout() call when we find the folio to be dirty. > > So there would not actually be more I/O. Only swap space would be > reserved, that would be used later when not running into the race. I am not worried about small folios at all as they have only one PTE. so the PTE is either completely unmapped or completely mapped. In terms of large folios, it is a different story. for example, a large folio with 16 PTEs with CONT-PTE, we will have 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries 2. page faults on PTE1-PTE15 after try_to_unmap if we access them. This is totally useless PF and can be avoided if we can try_to_unmap properly at the beginning. 3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT, MADV_FREE might split it after finding it is not completely mapped. For small folios, we don't have any concern on the above issues. > > -- > Cheers, > > David / dhildenb > Thanks Barry