Re: hole-punch vs fault

Matthew Wilcox <matthew@xxxxxx> · Wed, 27 Nov 2013 21:44:54 -0700

On Wed, Nov 27, 2013 at 07:33:43PM -0700, Matthew Wilcox wrote:
> On Wed, Nov 27, 2013 at 11:19:32PM +0100, Jan Kara wrote:
> > > The second is to put some kind of generation counter in the inode or
> > > address_space that is incremented on every deallocation.  The fault path
> > > would read the generation counter before calling find_get_page() and then
> > > check it hasn't changed after getting the page lock (goto retry_find).  I
> > > like that one a little more since we can fix it all in common code, and I
> > > think it'll be lower overhead in the fault path.
> >   I'm not sure I understand how this should work. After pagecache is
> > truncated, fault can come and happily instantiate the page again. After
> > that fault is done, hole punching awakes and removes blocks from under that
> > page. So checking the generation counter after you get the page lock seems
> > useless to me.
> 
> Yeah, I don't think the page lock is enough (maybe someone can convince
> me otherwise).  How does this look?  Checking the generation number
> after we get i_mutex ensures that the truncate has finished running.

Ohh.  We can't take i_mutex because that's an inversion with mmap_sem.
So option 1 is to convince ourselves that page lock is enough (and that
doesn't solve the problem for ext2-xip).

Option 2 is to do something similar to seqcount_t.  We can't use
seqcount_t as-is because it spins in read_seqcount_begin, and it could
be spinning for a very long time while we read a page in from disc!

How about something like this:

typedef struct seqtex {
	unsigned		sequence;
	spinlock_t		wait_lock;
	struct list_head	wait_list;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	struct lockdep_map	dep_map;
#endif
}

Take seqtex_read_lock() at the beginning of filemap_fault(), release
it after we have the page locked (since truncate_inode_pages_range will
block on the pagelock before tearing down the data structures).

Take seqtex_write_lock() at the start of truncate & holepunch operations,
drop it at the end of truncate & holepunch operations.

seqtex_read_lock blocks if we're inside a write lock.  seqtex_read_unlock
tells us whether to try again.  write_unlock wakes anybody waiting.
write_lock doesn't block because we're assuming the i_mutex is held above
us, so we can only have one writer at a time (similar to seqcount_t).

I keep hoping for an option 3.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html