Re: hole-punch vs fault

Jan Kara <jack@xxxxxxx> · Wed, 27 Nov 2013 23:19:32 +0100



  Hello,

On Wed 27-11-13 06:48:31, Matthew Wilcox wrote:
> I'm trying to figure out what protects us in this scenario:
> 
> Thread A mmaps a file and reads from it.
> Thread B punches holes in that file.
  Yeah, nothing. We discussed this problem with Dave Chinner here:
http://www.spinics.net/lists/linux-ext4/msg32059.html

> It seems to me that thread A can end up with references to pages that are
> in the page cache that point to deallocated blocks (which the filesystem
> might subsequently allocate to a different file).
> 
> Thread B calls ext4_punch_hole()
> Thread B takes ->i_mutex
> Thread B calls truncate_pagecache_range()
> Thread B calls unmap_mapping_range() which removes the PTE from Thread A's
>   page tables.
> Thread B calls truncate_inode_pages_range() which removes the pages from the
>   page cache.
> Thread A calls filemap_fault().  It calls find_get_page(), but it's not there.
>   It calls do_sync_mmap_readahead() then find_get_page() again, still not
>   there.  So it calls page_cache_read() which calls ext4_readpage which calls
>   mpage_readpage(), which calls ext4_get_block() which finds the blocks
>   in question, nowhere taking the i_mutex.
> Now Thread B gets back from its summer holiday, takes i_data_sem and calls
>   ext4_es_remove_extent() and ext4_ext_remove_space() /
>   ext4_free_hole_blocks() before releasing i_data_sem and i_mutex.
> Thread A happily dirties the page it has.
> Thread C extends a file, which gets the victim block allocated.
> vmscan causes the page to be written back, since it's dirty and on the LRU lists
> Bang, thread C has a corrupted file.
> 
> This doesn't happen with the normal truncate path since i_size is written
> before calling unmap_mapping_range, and i_size is checked after filemap_fault
> gets the page lock.
  Yup.
 
> How should we solve this?  I see two realistic options.  One is an rwsem
> in the inode or address_space taken for read by the fault path and for
> write by the holepunch path.  It's kind of icky because each filesystem
> will need to add support for it.
  This is doable. We actually want to go for range locks to scratch some
other itches we have in this area (see the above mentioned thread). But lock
ordering with mmap_sem gets interesting and requires changes to how
mmap_sem is used - which is likely the hardest-to-deal-with lock we currently
have in the kernel. For example I'm now getting to know video4linux drivers
more than I ever wanted to untangle mmap_sem from that code...

> The second is to put some kind of generation counter in the inode or
> address_space that is incremented on every deallocation.  The fault path
> would read the generation counter before calling find_get_page() and then
> check it hasn't changed after getting the page lock (goto retry_find).  I
> like that one a little more since we can fix it all in common code, and I
> think it'll be lower overhead in the fault path.
  I'm not sure I understand how this should work. After pagecache is
truncated, fault can come and happily instantiate the page again. After
that fault is done, hole punching awakes and removes blocks from under that
page. So checking the generation counter after you get the page lock seems
useless to me.

							Honza

-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html