On Fri, 15 Sep 2023, Matthew Wilcox wrote: > On Wed, Aug 16, 2023 at 01:27:17PM -0700, Hugh Dickins wrote: > > > This problem predates the folio work; it could for example have been > > > triggered by mmaping a THP in tmpfs and using that as the target of an > > > O_DIRECT read. > > > > > > Fixes: 800d8c63b2e98 ("shmem: add huge pages support") > > > > No. It's a good catch, but bug looks specific to the folio work to me. > > > > Almost all shmem pages are dirty from birth, even as soon as they are > > brought back from swap; so it is not necessary to re-mark them dirty. > > > > The exceptions are pages allocated to holes when faulted: so you did > > get me worried as to whether khugepaged could collapse a pmd-ful of > > those into a THP without marking the result as dirty. > > > > But no, in v6.5-rc6 the collapse_file() success path has > > if (is_shmem) > > folio_mark_dirty(folio); > > and in v5.10 the same appears as > > if (is_shmem) > > set_page_dirty(new_page); > > > > (IIRC, that or marking pmd dirty was missed from early shmem THP > > support, but fairly soon corrected, and backported to stable then. > > I have a faint memory of versions which assembled pmd_dirty from > > collected pte_dirtys.) > > > > And the !is_shmem case is for CONFIG_READ_ONLY_THP_FOR_FS: writing > > into those pages, by direct IO or whatever, is already prohibited. > > > > It's dem dirty (or not dirty) folios dat's the trouble! > > Thanks for the correction! Could it happen with anon THP? > They're not kept dirty from birth ... are they? Anon pages, THP or other, are not marked dirty from birth, right. But nor are they considered for freeing without writeout: shrink_folio_list() does add_to_swap() on them without considering dirtiness, and add_to_swap() does an unconditional folio_mark_dirty(). Well, not quite unconditional: it is conditional on allocating swap, but shrink_folio_list() just reactivates when swap is not allocated. So, I see no opportunity for data loss there. When it's read back from swap later on, the folio will be left clean while it matches swap: I haven't bothered to recheck the latest details of what happens when it's CoWed, or the swap is deleted, those details won't matter given the behavior above. But might there be a direct IO problem while that anon folio (large or small) remains clean in swapcache, when reclaim's pageout() might be liable to free it without rewriting? There ought not to be: get_user_pages()/follow_page_pte() have taken care of that for many years with the FOLL_TOUCH+FOLL_WRITE if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) set_page_dirty(page); and follow_trans_huge_pmd() dirties the pmd when FOLL_TOUCH+FOLL_WRITE. I forget why follow_page_pte() prefers to dirty page rather than pte, but either approach should be good enough to avoid the data loss. However, I don't see equivalent FOLL_TOUCH+FOLL_WRITE code where get_user_pages_fast() goes its fast route; nor pin_user_pages_fast() used by lib/iov_iter.c; and it looks odd that pin_user_pages_remote() and pin_user_pages_unlocked() use FOLL_TOUCH but pin_user_pages() not. Problem? Not particularly for anon or for large, but for any? Or not a problem because of final set_page_dirty() or folio_mark_dirty() on release - only a problem while that PageCompound check remains? (Of course filesystems hate behind-the-back dirtying for other reasons, that they may lose the data without proper notice: separate discussion we'd better not get back into here.) I've spent much longer trying to answer this than I could afford, expect no more from me, back to you and GUP+PIN experts. Hugh