On Mon, May 15, 2023 at 02:03:15PM +0300, Kirill A . Shutemov wrote: > On Thu, May 04, 2023 at 10:27:50PM +0100, Lorenzo Stoakes wrote: > > Writing to file-backed mappings which require folio dirty tracking using > > GUP is a fundamentally broken operation, as kernel write access to GUP > > mappings do not adhere to the semantics expected by a file system. > > > > A GUP caller uses the direct mapping to access the folio, which does not > > cause write notify to trigger, nor does it enforce that the caller marks > > the folio dirty. > > Okay, problem is clear and the patchset look good to me. But I'm worried > breaking existing users. > > Do we expect the change to be visible to real world users? If yes, are we > okay to break them? The general consensus at the moment is that there is no entirely reasonable usage of this case and you're already running the riks of a kernel oops if you do this, so it's already broken. > > One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..." > It is mostly used for pmem emulation. > > Do we have plan B? Yes, we can make it opt-in or opt-out via a FOLL_FLAG. This would be easy to implement in the event of any issues arising. > > Just a random/crazy/broken idea: > > - Allow folio_mkclean() (and folio_clear_dirty_for_io()) to fail, > indicating that the page cannot be cleared because it is pinned; > > - Introduce a new vm_operations_struct::mkclean() that would be called by > page_vma_mkclean_one() before clearing the range and can fail; > > - On GUP, create an in-kernel fake VMA that represents the file, but with > custom vm_ops. The VMA registered in rmap to get notified on > folio_mkclean() and fail it because of GUP. > > - folio_clear_dirty_for_io() callers will handle the new failure as > indication that the page can be written back but will stay dirty and > fs-specific data that is associated with the page writeback cannot be > freed. > > I'm sure the idea is broken on many levels (I have never looked closely at > the writeback path). But maybe it is good enough as conversation started? > Yeah there are definitely a few ideas down this road that might be possible, I am not sure how a filesystem can be expected to cope or this to be reasonably used without dirty/writeback though because you'll just not track anything or I guess you mean the mapping would be read-only but somehow stay dirty? I also had ideas along these lines of e.g. having a special vmalloc mode which mimics the correct wrprotect settings + does the right thing, but of course that does nothing to help DMA writing to a GUP-pinned page. Though if the issue is at the point of the kernel marking the page dirty unexpectedly, perhaps we can just invoke the mkwrite() _there_ before marking dirty? There are probably some sycnhronisation issues there too. Jason will have some thoughts on this I'm sure. I guess the key question here is - is it actually feasible for this to work at all? Once we establish that, the rest are details :) > -- > Kiryl Shutsemau / Kirill A. Shutemov