On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 22.02.24 08:05, Barry Song wrote: > > Hi Ryan, > > > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 2cc0cb41fb32..ea19710aa4cd 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >> if (!can_split_folio(folio, NULL)) > >> goto activate_locked; > >> /* > >> - * Split folios without a PMD map right > >> - * away. Chances are some or all of the > >> - * tail pages can be freed without IO. > >> + * Split PMD-mappable folios without a > >> + * PMD map right away. Chances are some > >> + * or all of the tail pages can be freed > >> + * without IO. > >> */ > >> - if (!folio_entire_mapcount(folio) && > >> + if (folio_test_pmd_mappable(folio) && > >> + !folio_entire_mapcount(folio) && > >> split_folio_to_list(folio, > >> folio_list)) > >> goto activate_locked; > > > > I ran a test to investigate what would happen while reclaiming a partially > > unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED > > 4KB~64KB, and keep the first subpage 0~4KiB. > > IOW, something that already happens with ordinary THP already IIRC. > > > > > My test wants to address my three concerns, > > a. whether we will have leak on swap slots > > b. whether we will have redundant I/O > > c. whether we will cause races on swapcache > > > > what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[] > > at some specific stage > > 1. just after add_to_swap (swap slots are allocated) > > 2. before and after try_to_unmap (ptes are set to swap_entry) > > 3. before and after pageout (also add printk in zram driver to dump all I/O write) > > 4. before and after remove_mapping > > > > The below is the dumped info for a particular large folio, > > > > 1. after add_to_swap > > [ 27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1 > > [ 27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40 > > > > as you can see, > > _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40) > > > > > > 2. before and after try_to_unmap > > [ 27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1 > > [ 27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80 > > [ 27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0 > > [ 27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40 > > > > as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes > > 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1 > > > > 3. before and after pageout > > [ 27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0 > > [ 27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40 > > [ 27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0 > > [ 27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1 > > [ 27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2 > > [ 27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3 > > [ 27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4 > > [ 27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5 > > [ 27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6 > > [ 27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7 > > [ 27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8 > > [ 27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9 > > [ 27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba > > [ 27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb > > [ 27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc > > [ 27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd > > [ 27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be > > [ 27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf > > [ 27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0 > > > > as you can see, obviously, we have done redundant I/O - 16 zram_write_page though > > 4~64KiB has been zap_pte_range before, we still write them to zRAM. > > > > 4. before and after remove_mapping > > [ 27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40 > > [ 27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0 > > [ 27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > > > as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1. > > all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap > > slot leak at all! > > > > Thus, only two concerns are left for me, > > 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio > > is partially unmapped. > > 2. large folio is added as a whole as a swapcache covering the range whose > > part has been zapped. I am not quite sure if this will cause some problems > > while some concurrent do_anon_page, swapin and swapout occurs between 3 and > > 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding... > > Just noting: I was running into something different in the past with > THP. And it's effectively the same scenario, just swapout and > MADV_DONTNEED reversed. > > Imagine you swapped out a THP and the THP it still is in the swapcache. > > Then you unmap/zap some PTEs, freeing up the swap slots. > > In zap_pte_range(), we'll call free_swap_and_cache(). There, we run into > the "!swap_page_trans_huge_swapped(p, entry)", and we won't be calling > __try_to_reclaim_swap(). I guess you mean swap_page_trans_huge_swapped(p, entry) not !swap_page_trans_huge_swapped(p, entry) ? at that time, swap_page_trans_huge_swapped should be true as there are still some entries whose swap_map=0x41 or above (SWAP_HAS_CACHE and swap_count >= 1) static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, swp_entry_t entry, unsigned int nr_pages) { ... for (i = 0; i < nr_pages; i++) { if (swap_count(map[offset + i])) { ret = true; break; } } unlock_out: unlock_cluster_or_swap_info(si, ci); return ret; } So this will stop the swap free even for those ptes which have been zapped? Another case I have reported[1] is that while reclaiming a large folio, in try_to_unmap_one, we are calling page_vma_mapped_walk(). as it only begins to hold PTL after it hits a valid pte, a paralel break-before-make might make 0nd, 1st and beginning PTEs zero, try_to_unmap_one can read intermediate ptes value, thus we can run into this below case. afte try_to_unmap_one, pte 0 - untouched, present pte pte 1 - untouched, present pte pte 2 - swap entries pte 3 - swap entries ... pte n - swap entries or pte 0 - untouched, present pte pte 1 - swap entries pte 2 - swap entries pte 3 - swap entries ... pte n - swap entries etc. Thus, after try_to_unmap, the folio is still mapped with some ptes becoming swap entries, some PTEs are still present. it might be staying in swapcache for a long time with a broken CONT-PTE. I also hate that and hope for a SYNC way to let large folio hold PTL from the 0nd pte, thus, it won't get intermediate PTEs from other break-before-make. This also doesn't increase PTL contention as my test shows we will always get PTL for a large folio after skipping zero, one or two PTEs in try_to_unmap_one. but skipping 1 or 2 is really bad and sad, breaking a large folio from the same whole to nr_pages different segments. [1] https://lore.kernel.org/linux-mm/CAGsJ_4wo7BiJWSKb1K_WyAai30KmfckMQ3-mCJPXZ892CtXpyQ@xxxxxxxxxxxxxx/ > > So we won't split the large folio that is in the swapcache and it will > continue consuming "more memory" than intended until fully evicted. > > > > > To me, it seems safer to split or do some other similar optimization if we find a > > large folio has partial map and unmap. > > I'm hoping that we can avoid any new direct users of _nr_pages_mapped if > possible. > Is _nr_pages_mapped < nr_pages a reasonable case to split as we have known the folio has at least some subpages zapped? > If we find that the folio is on the deferred split list, we might as > well just split it right away, before swapping it out. That might be a > reasonable optimization for the case you describe. i tried to change Ryan's code as below @@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, * PMD map right away. Chances are some * or all of the tail pages can be freed * without IO. + * Similarly, split PTE-mapped folios if + * they have been already deferred_split. */ - if (folio_test_pmd_mappable(folio) && - !folio_entire_mapcount(folio) && - split_folio_to_list(folio, - folio_list)) + if (((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) || + (!folio_test_pmd_mappable(folio) && !list_empty(&folio->_deferred_list))) + && split_folio_to_list(folio, folio_list)) goto activate_locked; } if (!add_to_swap(folio)) { It seems to work as expected. only one I/O is left for a large folio with 16 PTEs but 15 of them have been zapped before. > > -- > Cheers, > > David / dhildenb > Thanks Barry