On Wed, Feb 28, 2024 at 2:37 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 05/02/2024 09:51, Barry Song wrote: > > +Chris, Suren and Chuanhua > > > > Hi Ryan, > > > >> + /* > >> + * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard, > >> + * so indicate that we are scanning to synchronise with swapoff. > >> + */ > >> + si->flags += SWP_SCANNING; > >> + ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order); > >> + si->flags -= SWP_SCANNING; > > > > nobody is using this scan_base afterwards. it seems a bit weird to > > pass a pointer. > > > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >> if (!can_split_folio(folio, NULL)) > >> goto activate_locked; > >> /* > >> - * Split folios without a PMD map right > >> - * away. Chances are some or all of the > >> - * tail pages can be freed without IO. > >> + * Split PMD-mappable folios without a > >> + * PMD map right away. Chances are some > >> + * or all of the tail pages can be freed > >> + * without IO. > >> */ > >> - if (!folio_entire_mapcount(folio) && > >> + if (folio_test_pmd_mappable(folio) && > >> + !folio_entire_mapcount(folio) && > >> split_folio_to_list(folio, > >> folio_list)) > >> goto activate_locked; > >> -- > > > > Chuanhua and I ran this patchset for a couple of days and found a race > > between reclamation and split_folio. this might cause applications get > > wrong data 0 while swapping-in. > > > > in case one thread(T1) is reclaiming a large folio by some means, still > > another thread is calling madvise MADV_PGOUT(T2). and at the same time, > > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split > > and T2 does split as below, > Hi Ryan, > Hi Barry, > > Do you have a test case you can share that provokes this problem? And is this a > separate problem to the race you solved with TTU_SYNC or is this solving the > same problem? They are the same. After sending you the report about the races, I spent some time and finally figured out what was happening, why corrupted data came while swapping in. it is absolutely not your fault, but TTU_SYNC does somehow resolve my problem though it is not the root cause. this corrupted data only can reproduce after applying patch 4[1] of swap-in series, [1] [PATCH RFC 4/6] mm: support large folios swapin as a whole https://lore.kernel.org/linux-mm/20240118111036.72641-5-21cnbao@xxxxxxxxx/ In case we have a large folio with 16 PTEs as below, and after add_to_swap(), they get swapoffset 0x10000, their PTEs are all present as they are still mapped. PTE pte_stat PTE0 present PTE1 present PTE2 present PTE3 present ... PTE15 present then we get to try_to_unmap_one, as try_to_unmap_one doesn't hold PTL from PTE0, while it scans PTEs, we might have PTE pte_stat PTE0 none (someone is writing PTE0 for various reasons) PTE1 present PTE2 present PTE3 present ... PTE15 present We hold PTL from PTE1. after try_to_unmap_one, PTEs become PTE pte_stat PTE0 present (someone finished the write of PTE0) PTE1 swap 0x10001 PTE2 swap 0x10002 PTE3 swap 0x10003 ... ... PTE15 swap 0x1000F Thus, after try_to_unmap_one, the large folio is still mapped. so its swapcache will still be there. Now a parallel thread runs MADV_PAGEOUT, and it finds this large folio is not completely mapped, so it splits the folio into 16 small folios but their swap offsets are kept. Now in swapcache, we have 16 folios with contiguous swap offsets. MADV_PAGEOUT will reclaim these 16 folios, after new 16 try_to_unmap_one, PTE pte_stat PTE0 swap 0x10000 SWAP_HAS_CACHE PTE1 swap 0x10001 SWAP_HAS_CACHE PTE2 swap 0x10002 SWAP_HAS_CACHE PTE3 swap 0x10003 SWAP_HAS_CACHE ... PTE15 swap 0x1000F SWAP_HAS_CACHE >From this time, we can have various different cases for these 16 PTEs. for example, PTE pte_stat PTE0 swap 0x10000 SWAP_HAS_CACHE = 0 -> become false due to finished pageout and remove_mapping PTE1 swap 0x10001 SWAP_HAS_CACHE = 0 -> become false due to finished pageout and remove_mapping PTE2 swap 0x10002 SWAP_HAS_CACHE = 0 -> become false due to concurrent swapin and swapout PTE3 swap 0x10003 SWAP_HAS_CACHE = 1 ... PTE13 swap 0x1000D SWAP_HAS_CACHE = 1 PTE14 swap 0x1000E SWAP_HAS_CACHE = 1 PTE15 swap 0x1000F SWAP_HAS_CACHE = 1 but all of them have swp_count = 1 and different SWAP_HAS_CACHE. some of these small folios might be in swapcache, some others might not be in. then we do_swap_page at one PTE whose SWAP_HAS_CACHE=0 and swap_count=1 (the folio is not in swapcache, thus has been written to swap), we do this check: static bool pte_range_swap(pte_t *pte, int nr_pages) { int i; swp_entry_t entry; unsigned type; pgoff_t start_offset; entry = pte_to_swp_entry(ptep_get_lockless(pte)); if (non_swap_entry(entry)) return false; start_offset = swp_offset(entry); if (start_offset % nr_pages) return false; type = swp_type(entry); for (i = 1; i < nr_pages; i++) { entry = pte_to_swp_entry(ptep_get_lockless(pte + i)); if (non_swap_entry(entry)) return false; if (swp_offset(entry) != start_offset + i) return false; if (swp_type(entry) != type) return false; } return true; } as those swp entries are contiguous, we will call swap_read_folio(). For those folios which are still in swapcache and haven't been written, we get zero-filled data from zRAM. So the root cause is that pte_range_swap should check all 16 swap_map have the same SWAP_HAS_CACHE as false. static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages) { ... count = si->swap_map[start_offset]; for (i = 1; i < nr_pages; i++) { entry = pte_to_swp_entry(ptep_get_lockless(pte + i)); if (non_swap_entry(entry)) return false; if (swp_offset(entry) != start_offset + i) return false; if (swp_type(entry) != type) return false; /* fallback to small folios if SWAP_HAS_CACHE isn't same */ if (si->swap_map[start_offset + i] != count) return false; } return true; } but somehow TTU_SYNC "resolves" it by giving no chance to MADV_PAGEOUT to split this folio as the large folio are either entirely written by swap entries, or entirely keep present PTEs. Though the bug is within the swap-in series, I am still a big fan of TTU_SYNC for large folio reclamation for at least three reasons, 1. We remove some possibility that large folios fail to be reclaimed, improving reclamation efficiency. 2. We avoid many strange cases and potential folio_split during reclamation. without TTU_SYNC, folios can be splitted later, or partially being set to swap entries while partially being still present 3. we don't increase PTL contention. My test shows try_to_unmap_one will always get PTL after it sometimes skips one or two PTEs because intermediate break-before-makes are short. Of course, most time try_to_unmap_one will get PTL from PTE0. > > Thanks, > Ryan > Thanks Barry