mmap writes vs truncate causing data corruption

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 17 Sep 2014 19:28:05 +1000

Hi folks,

Brian, Eric and I have been tracking down a set of data corruption
problems on XFS over the past couple of days. The one that is
important to the wider developer community is the truncate/mmap
write issue that Eric isolated from a real-world application that
was triggering it.

The corruption only affects block size smaller than page size
configurations and is caused by mmapped writes to the EOF page
which has been partially truncated. If we then extend the file
again, the region of the page that was truncated and had blocks
punched out of it can be written to via mapped writes without blocks
being allocated for the hole. Hence while the page is in the page
cache, the contents of the file look OK. Unmount/mount the
filesystem, then re-read the page from disk and it will contain
zeros because there is a hole rather than data blocks.

In the XFS case, the bug was that the filesystem truncate code is
not cleaning the partial page fully during the truncate down or up,
and hence the pte remains mapped dirty in the TLB. Hence when new
data is written to the page, it doesn't trigger a write fault,
->page_mkwrite is not called and hence blocks are not allocated over
the hole. I chose to fix it on the truncate up as it was the lesser
of two evils - we can't actually fix the problem entirely because we
can't serialise page faults against truncate.

Initially I couldn't reproduce the data corruptions on ext4, but
Eric came to my rescue and provided me with an updated mremap test
that triggered corruptions. I also added another variant to the
plain truncate/mwrite test and so now that itest also reliably
produces data corruptions on ext4. I suspect the ext4 issue is
similar to the XFS case (i.e. no page_mkwrite call), but I can't
follow the ext4 code with any level of cluefulness....

And so: practise what I preach and post a heads-up to -fsdevel.

That is, if two filesystems that support block size smaller than
page size have similar data corruptions when exercising the same
generic code paths in similar ways, then it is likely that other
filesystems have similar problems and need to be checked.

While the tests I packaged for xfstests are not yet reviewed, they
do work and expose the corruptions on both XFS and ext4. Hence I've
pushed them to a git tree branch so that everyone can test their
filesystems against the reproducers. The tests in question are
generic/029 and generic/030 and can be found here:

git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git mmap-truncate

FWIW, any filesystem that supports FALLOC_FL_COLLAPSE_RANGE should
also have generic/031 run against it. This is the test case that
Brian isolated from an fsx failure that exposed a different partial
page truncation data corruption issue in XFS with block size smaller
than page size. However, it's a similar situation with ext4: the
exact same underlying partial page writeback bug was found in ext4
back in May and fixed in 3.16....

Most importantly, all the credit must go to Eric and Brian for doing
the hard work of turning application failures into simple,
reproducable test cases.  Finding bugs is easy when you are provided
with a 100% reliable reproducer and a bunch of analysis about where
the bug most likely lies. :)

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html