On Oct 28, 2022, at 5:42 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > I think the proper fix (or at least _a_ proper fix) would be to > actually carry the dirty bit along to the __tlb_remove_page() point, > and actually treat it exactly the same way as the page pointer itself > - set the page dirty after the TLB flush, the same way we can free the > page after the TLB flush. > > We could easiy hide said dirty bit in the low bits of the > "batch->pages[]" array or something like that. We'd just have to add > the 'dirty' argument to __tlb_remove_page_size() and friends. Thank you for your quick response. I was slow to respond due to a jet lag. Anyhow, I am not sure whether the solution that you propose would work. Please let me know if my understanding makes sense. Let’s assume that we do not call set_page_dirty() before we remove the rmap but only after we invalidate the page [*]. Let’s assume that shrink_page_list() is called after the page’s rmap is removed and the page is no longer mapped, but before set_page_dirty() was actually called. In such a case, shrink_page_list() would consider the page clean, and would indeed keep the page (since __remove_mapping() would find elevated page refcount), which appears to give us a chance to mark the page as dirty later. However, IIUC, in this case shrink_page_list() might still call filemap_release_folio() and release the buffers, so calling set_page_dirty() afterwards - after the actual TLB invalidation took place - would fail. > Your idea of "do the page_remove_rmap() late instead" would also work, > but the reason I think just squirrelling away the dirty bit is the > "proper" fix is that it would get rid of the whole need for > 'force_flush' in this area entirely. So we'd not only fix that race > you noticed, we'd actually do so and reduce the number of TLB flushes > too. I’m all for reducing the number of TLB flushes, and your solution does sound better in general. I proposed something that I considered having the path of least resistance (i.e., least chance of breaking something). I can do what you propsosed, but I am not sure how to deal with the buffers being removed. One more note: This issue, I think, also affects migrate_vma_collect_pmd(). Alistair recently addressed an issue there, but in my prior feedback to him I missed this issue. [*] Note that this would be for our scenario pretty much the same if we also called set_page_dirty() before removing the rmap, but the page was cleaned while the TLB invalidation has still not been performed.