On Fri, May 18, 2018 at 08:51:38PM -0700, Dan Williams wrote: > >> +1, and I am now super-interested in this conversation, because > >> after tracking down a kernel BUG to this classic mistaken pattern: > >> > >> get_user_pages (on file-backed memory from ext4) > >> ...do some DMA > >> set_pages_dirty > >> put_page(s) > > > > Ummm, RDMA has done essentially that since 2005, since when did it > > become wrong? Do you have some references? Is there some alternative? > > > > See __ib_umem_release > > > >> ...there is (rarely!) a backtrace from ext4, that disavows ownership of > >> any such pages. > > > > Yes, I've seen that oops with RDMA, apparently isn't actually that > > rare if you tweak things just right. > > > > I thought it was an obscure ext4 bug :( > > > >> Because the obvious "fix" in device driver land is to use a dedicated > >> buffer for DMA, and copy to the filesystem buffer, and of course I will > >> get *killed* if I propose such a performance-killing approach. But a > >> core kernel fix really is starting to sound attractive. > > > > Yeah, killed is right. That idea totally cripples RDMA. > > > > What is the point of get_user_pages FOLL_WRITE if you can't write to > > and dirty the pages!?! > > You're oversimplifying the problem, here are the details: > > https://www.spinics.net/lists/linux-mm/msg142700.html Suggestion 1: in get_user_pages_fast(), mark the page as dirty, but don't tag the radix tree entry as dirty. Then vmscan() won't find it when it's looking to write out dirty pages. Only mark it as dirty in the radix tree once we call set_page_dirty_lock(). Suggestion 2: in get_user_pages_fast(), replace the page in the radix tree with a special entry that means "page under io". In set_page_dirty_lock(), replace the "page under io" entry with the struct page pointer. Both of these suggestions have trouble with simultaneous sub-page IOs to the same page. Do we care? I suspect we might as pages get larger (see also: supporting THP pages in the page cache).