Re: mmap writes vs truncate causing data corruption

Jan Kara <jack@xxxxxxx> · Tue, 23 Sep 2014 14:27:54 +0200



  Hi,

On Wed 17-09-14 19:28:05, Dave Chinner wrote:
> Brian, Eric and I have been tracking down a set of data corruption
> problems on XFS over the past couple of days. The one that is
> important to the wider developer community is the truncate/mmap
> write issue that Eric isolated from a real-world application that
> was triggering it.
> 
> The corruption only affects block size smaller than page size
> configurations and is caused by mmapped writes to the EOF page
> which has been partially truncated. If we then extend the file
> again, the region of the page that was truncated and had blocks
> punched out of it can be written to via mapped writes without blocks
> being allocated for the hole. Hence while the page is in the page
> cache, the contents of the file look OK. Unmount/mount the
> filesystem, then re-read the page from disk and it will contain
> zeros because there is a hole rather than data blocks.
  Hum, this is what we already discussed in
http://lists.openwall.net/linux-ext4/2014/03/13/23, isn't it? I never
thought about using mremap() in the test cases. That makes it even a POSIX
valid test case... Nasty.

> In the XFS case, the bug was that the filesystem truncate code is
> not cleaning the partial page fully during the truncate down or up,
> and hence the pte remains mapped dirty in the TLB. Hence when new
> data is written to the page, it doesn't trigger a write fault,
> ->page_mkwrite is not called and hence blocks are not allocated over
> the hole. I chose to fix it on the truncate up as it was the lesser
> of two evils - we can't actually fix the problem entirely because we
> can't serialise page faults against truncate.
  Actually, as I mentioned in the above email, exactly the same problem
happens when file gets extended because of a write beyond EOF (just change
truncate up for pwrite in your test cases). You didn't handle that case in
your XFS patch AFAICS.

> Initially I couldn't reproduce the data corruptions on ext4, but
> Eric came to my rescue and provided me with an updated mremap test
> that triggered corruptions. I also added another variant to the
> plain truncate/mwrite test and so now that itest also reliably
> produces data corruptions on ext4. I suspect the ext4 issue is
> similar to the XFS case (i.e. no page_mkwrite call), but I can't
> follow the ext4 code with any level of cluefulness....
  I'm surprised ext4 is vulnerable. When I was checking a few years back
(2009 or so) it was not because if we found dirty buffers not marked
delalloc we just bit the bullet and tried allocating blocks. But probably
this got broken... checking the code... yeah, I broke that when rewriting
ext4 writeback path :-|

> That is, if two filesystems that support block size smaller than
> page size have similar data corruptions when exercising the same
> generic code paths in similar ways, then it is likely that other
> filesystems have similar problems and need to be checked.
  Frankly, I'd like to handle the problem in the generic code rather than
having hacks in various filesystems. I have a patch back from 2009 which
implements a helper function which gets called when creating a hole (either
from ->setattr or ->write_end) and which handles this. It also has various
optimizations built in - it doesn't do anything when blocksize == pagesize
or when no hole block is actually created. Also it doesn't do any IO as you
do in XFS - it only writeprotects the page.  I'll port the patch and try it
out with ext4.

								Honza

-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html