On Mon, Jan 01, 2024 at 07:33:16PM +0800, Hillf Danton wrote: > On Mon, 1 Jan 2024 09:07:52 +0000 Matthew Wilcox > > On Mon, Jan 01, 2024 at 09:55:04AM +0800, Hillf Danton wrote: > > > On Sun, 31 Dec 2023 13:07:03 +0000 Matthew Wilcox <willy@xxxxxxxxxxxxx> > > > > I don't think this can happen. Look at the call trace; > > > > block_dirty_folio() is called from unmap_page_range(). That means the > > > > page is in the page tables. We unmap the pages in a folio from the > > > > page tables before we set folio->mapping to NULL. Look at > > > > invalidate_inode_pages2_range() for example: > > > > > > > > unmap_mapping_pages(mapping, indices[i], > > > > (1 + end - indices[i]), false); > > > > folio_lock(folio); > > > > folio_wait_writeback(folio); > > > > if (folio_mapped(folio)) > > > > unmap_mapping_folio(folio); > > > > BUG_ON(folio_mapped(folio)); > > > > if (!invalidate_complete_folio2(mapping, folio)) > > > > > > > What is missed here is the same check [1] in invalidate_inode_pages2_range(), > > > so I built no wheel. > > > > > > folio_lock(folio); > > > if (unlikely(folio->mapping != mapping)) { > > > folio_unlock(folio); > > > continue; > > > } > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/truncate.c#n658 > > > > That's entirely different. That's checking in the truncate path whether > > somebody else already truncated this page. What I was showing was why > > a page found through a page table walk cannot have been truncated (which > > is actually quite interesting, because it's the page table lock that > > prevents the race). > > > Feel free to shed light on how ptl protects folio->mapping. The documentation for __folio_mark_dirty() hints at it: * The caller must hold folio_memcg_lock(). Most callers have the folio * locked. A few have the folio blocked from truncation through other * means (eg zap_vma_pages() has it mapped and is holding the page table * lock). This can also be called from mark_buffer_dirty(), which I * cannot prove is always protected against truncate. Re-reading that now, I _think_ mark_buffer_dirty() always has to be called with a reference on the bufferhead, which means that a racing truncate will fail due to invalidate_inode_pages2_range -> invalidate_complete_folio2 -> filemap_release_folio -> try_to_free_buffers -> drop_buffers -> buffer_busy >From an mm point of view, what is implicit is that truncate calls unmap_mapping_folio -> unmap_mapping_range_tree -> unmap_mapping_range_vma -> zap_page_range_single -> unmap_single_vma -> unmap_page_range -> zap_p4d_range -> zap_pud_range -> zap_pmd_range -> zap_pte_range -> pte_offset_map_lock() So a truncate will take the page lock, then spin on the pte lock until the racing munmap() has finished (ok, this was an exit(), not a munmap(), but exit() does an implicit munmap()).