On Thu 21-06-18 18:30:36, Jan Kara wrote: > On Wed 20-06-18 15:55:41, John Hubbard wrote: > > On 06/20/2018 05:08 AM, Jan Kara wrote: > > > On Tue 19-06-18 11:11:48, John Hubbard wrote: > > >> On 06/19/2018 03:41 AM, Jan Kara wrote: > > >>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote: > > >>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote: > > [...] > > >>> I'm also still pondering the idea of inserting a "virtual" VMA into vma > > >>> interval tree in the inode - as the GUP references are IMHO closest to an > > >>> mlocked mapping - and that would achieve all the functionality we need as > > >>> well. I just didn't have time to experiment with it. > > >> > > >> How would this work? Would it have the same virtual address range? And how > > >> does it avoid the problems we've been discussing? Sorry to be a bit slow > > >> here. :) > > > > > > The range covered by the virtual mapping would be the one sent to > > > get_user_pages() to get page references. And then we would need to teach > > > page_mkclean() to check for these virtual VMAs and block / skip / report > > > (different situations would need different behavior) such page. But this > > > second part is the same regardless how we identify a page that is pinned by > > > get_user_pages(). > > > > > > OK. That neatly avoids the need a new page flag, I think. But of course it is > > somewhat more extensive to implement. Sounds like something to keep in mind, > > in case it has better tradeoffs than the direction I'm heading so far. > > Yes, the changes needed are somewhat more intrusive. I'm looking into this > approach now to see how the result will look like... I've spent some time on this. There are two obstacles with my approach of putting special entry into inode's VMA tree: 1) If I want to place this special entry in inode's VMA tree, I either need to allocate full VMA, somehow initiate it so that it's clear it's a special "pinned" range, not a VMA => uses unnecessarily too much memory, it is ugly. Another solution I was hoping for was that I would factor out some common bits of vm_area_struct (pgoff, rb_node, ..) into a structure common for VMA and the locked range => doable but causes a lot of churn as VMAs are accessed (and modified!) at hundreds of places in the kernel. Some accessor functions would help to reduce the churn a bit but then stuff like vma_set_pgoff(vma, pgoff) isn't exactly beautiful either. 2) Some users of GUP (e.g. direct IO) get a block of pages and then put references to these pages at different times and in random order - basically when IO for given page is completed, reference is dropped and one GUP call can acquire page references for pages which end up in multiple different bios (we don't know in advance). This makes is difficult to implement counterpart to GUP to 'unpin' a range of pages - we'd either have to support partial unpins (and splitting of pinned ranges and all such fun) or just have to track internally in how many pages are still pinned in the originally pinned range and release the pin once all individual pages are unpinned but then it's difficult to e.g. get to this internal structure from IO completion callback where we only have the bio. So I think the Matthew's idea of removing pinned pages from LRU is definitely worth trying to see how complex that would end up being. Did you get to looking into it? If not, I can probably find some time to try that out. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR