From: John Hubbard <jhubbard@xxxxxxxxxx> This fixes a few problems that came up when using devices (NICs, GPUs, for example) that want to have direct access to a chunk of system (CPU) memory, so that they can DMA to/from that memory. Problems [1] come up if that memory is backed by persistence storage; for example, an ext4 file system. I've been working on several customer bugs that are hitting this, and this patchset fixes those bugs. The bugs happen via: -- get_user_pages() on some ext4-backed pages -- device does DMA for a while to/from those pages -- Somewhere in here, some of the pages get disconnected from the file system, via try_to_unmap() and eventually drop_buffers() -- device is all done, device driver calls set_page_dirty_locked, then put_page() And then at some point, we see a this BUG(): kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) How to fix this: If a page is pinned by any of the get_user_page("gup", here) variants, then there is no need for that page to be on an LRU. So, this patchset removes such pages from their LRU, thus leaving the page->lru fields *mostly* available for tracking gup pages. (The lowest bit of page->lru.next is used as PageTail, and these flags have to be checked when we don't know if it really is a tail page or not, so avoid that bit.) After that, the page is reference-counted via page->dma_pinned_count, and flagged via page->dma_pinned_flags. The PageDmaPinned flag is cleared when the reference count hits zero, and the reference count is only used when the flag is set. All of the above provides a reliable PageDmaPinned flag, which is then used to decide when to abort or wait for operations such as: try_to_unmap() page_mkclean() In order to handle page_mkclean(), new information had to be plumbed down from the filesystems, so that page_mkclean can decide whether to skip dma-pinned pages, or to wait for them. Thanks to Matthew Wilcox for suggesting re-using page->lru fields for a new refcount and flag, and to Jan Kara for explaining the rest of the design details (how to deal with page_mkclean() and try_to_unmap(), especially). Also thanks to Dan Williams for design advice and DAX, long-term pinning, and page flag thoughts. References: [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" Changes since v1: -- Use page->lru and full reference counting, instead of a single page flag. -- Proper handling of page_mkclean(). John Hubbard (6): mm: get_user_pages: consolidate error handling mm: introduce page->dma_pinned_flags, _count mm: introduce zone_gup_lock, for dma-pinned pages mm/fs: add a sync_mode param for clear_page_dirty_for_io() mm: track gup pages with page->dma_pinned_* fields mm: page_mkclean, ttu: handle pinned pages drivers/video/fbdev/core/fb_defio.c | 3 +- fs/9p/vfs_addr.c | 2 +- fs/afs/write.c | 6 +- fs/btrfs/extent_io.c | 14 ++--- fs/btrfs/file.c | 2 +- fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/ceph/addr.c | 4 +- fs/cifs/cifssmb.c | 3 +- fs/cifs/file.c | 5 +- fs/ext4/inode.c | 5 +- fs/f2fs/checkpoint.c | 4 +- fs/f2fs/data.c | 2 +- fs/f2fs/dir.c | 2 +- fs/f2fs/gc.c | 4 +- fs/f2fs/inline.c | 2 +- fs/f2fs/node.c | 10 ++-- fs/f2fs/segment.c | 3 +- fs/fuse/file.c | 2 +- fs/gfs2/aops.c | 2 +- fs/nfs/write.c | 2 +- fs/nilfs2/page.c | 2 +- fs/nilfs2/segment.c | 10 ++-- fs/ubifs/file.c | 2 +- fs/xfs/xfs_aops.c | 2 +- include/linux/mm.h | 22 ++++++- include/linux/mm_types.h | 22 +++++-- include/linux/mmzone.h | 7 +++ include/linux/page-flags.h | 50 ++++++++++++++++ include/linux/rmap.h | 4 +- mm/gup.c | 93 +++++++++++++++++++++++------ mm/memcontrol.c | 7 +++ mm/memory-failure.c | 3 +- mm/migrate.c | 2 +- mm/page-writeback.c | 14 +++-- mm/page_alloc.c | 1 + mm/rmap.c | 71 ++++++++++++++++++++-- mm/swap.c | 48 +++++++++++++++ mm/truncate.c | 3 +- mm/vmscan.c | 2 +- 40 files changed, 361 insertions(+), 85 deletions(-) -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html