On 05/18/2018 01:23 PM, Dan Williams wrote: > On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@xxxxxxxx> wrote: >> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote: >>> On Fri, 18 May 2018, Jason Gunthorpe wrote: >>> ---8<--------------------------------- >>> >>> The newcomer here is RDMA. The FS side is the mainstream use case and has >>> been there since Unix learned to do paging. >> >> Well, it has been this way for 12 years, so it isn't that new. >> >> Honestly it sounds like get_user_pages is just a broken Linux >> API?? >> >> Nothing can use it to write to pages because the FS could explode - >> RDMA makes it particularly easy to trigger this due to the longer time >> windows, but presumably any get_user_pages could generate a race and >> hit this? Is that right? +1, and I am now super-interested in this conversation, because after tracking down a kernel BUG to this classic mistaken pattern: get_user_pages (on file-backed memory from ext4) ...do some DMA set_pages_dirty put_page(s) ...there is (rarely!) a backtrace from ext4, that disavows ownership of any such pages. It happens rarely enough that people have come to believe that the pattern is OK, from what I can tell. But some new, cutting edge systems with zillions of threads and lots of memory are able to expose the problem. Anyway, I've been dividing my time between trying to prove exactly which FS action is disconnecting the page from ext4 in this particular bug (even though it's lately becoming well-documented that the design itself is not correct), and casting about for the most proper place to fix this. Because the obvious "fix" in device driver land is to use a dedicated buffer for DMA, and copy to the filesystem buffer, and of course I will get *killed* if I propose such a performance-killing approach. But a core kernel fix really is starting to sound attractive. >> >> I am left with the impression that solving it in the FS is too >> performance costly so FS doesn't want that overheard? Was that also >> the conclusion? >> >> Could we take another crack at this during Linux Plumbers? Will the MM >> parties be there too? I'm sorry I wasn't able to attend LSFMM this >> year! > > Yes, you and hch were missed, and I had to skip the last day due to a > family emergency. > > Plumbers sounds good to resync on this topic, but we already have a > plan, use "break_layouts()" to coordinate a filesystem's need to move > dax blocks around relative to an active RDMA memory registration. If > you never punch a hole in the middle of your RDMA registration then > you never incur any performance penalty. Otherwise the layout break > notification is just there to tell the application "hey man, talk to > your friend that punched a hole in the middle of your mapping, but the > filesystem wants this block back now. Sorry, I'm kicking you out. Ok, > bye.". > > In other words, get_user_pages_longterm() is just a short term > band-aid for RDMA until we can get that infrastructure built. We don't > need to go down any mmu-notifier rabbit holes. > git grep claims that break_layouts is so far an XFS-only feature, though. Were there plans to fix this for all filesystems? thanks, -- John Hubbard NVIDIA