On Wed, Nov 27, 2013 at 07:33:43PM -0700, Matthew Wilcox wrote: > On Wed, Nov 27, 2013 at 11:19:32PM +0100, Jan Kara wrote: > > > The second is to put some kind of generation counter in the inode or > > > address_space that is incremented on every deallocation. The fault path > > > would read the generation counter before calling find_get_page() and then > > > check it hasn't changed after getting the page lock (goto retry_find). I > > > like that one a little more since we can fix it all in common code, and I > > > think it'll be lower overhead in the fault path. > > I'm not sure I understand how this should work. After pagecache is > > truncated, fault can come and happily instantiate the page again. After > > that fault is done, hole punching awakes and removes blocks from under that > > page. So checking the generation counter after you get the page lock seems > > useless to me. > > Yeah, I don't think the page lock is enough (maybe someone can convince > me otherwise). How does this look? Checking the generation number > after we get i_mutex ensures that the truncate has finished running. Ohh. We can't take i_mutex because that's an inversion with mmap_sem. So option 1 is to convince ourselves that page lock is enough (and that doesn't solve the problem for ext2-xip). Option 2 is to do something similar to seqcount_t. We can't use seqcount_t as-is because it spins in read_seqcount_begin, and it could be spinning for a very long time while we read a page in from disc! How about something like this: typedef struct seqtex { unsigned sequence; spinlock_t wait_lock; struct list_head wait_list; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif } Take seqtex_read_lock() at the beginning of filemap_fault(), release it after we have the page locked (since truncate_inode_pages_range will block on the pagelock before tearing down the data structures). Take seqtex_write_lock() at the start of truncate & holepunch operations, drop it at the end of truncate & holepunch operations. seqtex_read_lock blocks if we're inside a write lock. seqtex_read_unlock tells us whether to try again. write_unlock wakes anybody waiting. write_lock doesn't block because we're assuming the i_mutex is held above us, so we can only have one writer at a time (similar to seqcount_t). I keep hoping for an option 3. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html