On Tue, Sep 23, 2014 at 02:27:54PM +0200, Jan Kara wrote: > Hi, > > On Wed 17-09-14 19:28:05, Dave Chinner wrote: > > Brian, Eric and I have been tracking down a set of data corruption > > problems on XFS over the past couple of days. The one that is > > important to the wider developer community is the truncate/mmap > > write issue that Eric isolated from a real-world application that > > was triggering it. > > > > The corruption only affects block size smaller than page size > > configurations and is caused by mmapped writes to the EOF page > > which has been partially truncated. If we then extend the file > > again, the region of the page that was truncated and had blocks > > punched out of it can be written to via mapped writes without blocks > > being allocated for the hole. Hence while the page is in the page > > cache, the contents of the file look OK. Unmount/mount the > > filesystem, then re-read the page from disk and it will contain > > zeros because there is a hole rather than data blocks. > Hum, this is what we already discussed in > http://lists.openwall.net/linux-ext4/2014/03/13/23, isn't it? I never > thought about using mremap() in the test cases. That makes it even a POSIX > valid test case... Nasty. Yup, and the test case came from an application rather than being something that was thought up in a drunken rampage of random syscalls.... > > In the XFS case, the bug was that the filesystem truncate code is > > not cleaning the partial page fully during the truncate down or up, > > and hence the pte remains mapped dirty in the TLB. Hence when new > > data is written to the page, it doesn't trigger a write fault, > > ->page_mkwrite is not called and hence blocks are not allocated over > > the hole. I chose to fix it on the truncate up as it was the lesser > > of two evils - we can't actually fix the problem entirely because we > > can't serialise page faults against truncate. > Actually, as I mentioned in the above email, exactly the same problem > happens when file gets extended because of a write beyond EOF (just change > truncate up for pwrite in your test cases). You didn't handle that case in > your XFS patch AFAICS. That's because XFS already does tail block zeroing on truncate up earlier in the truncate code (the xfs_zero_eof() call). The patch I wrote simply stabilises the page so that a new page fault occurs and remaps it correctly. > > That is, if two filesystems that support block size smaller than > > page size have similar data corruptions when exercising the same > > generic code paths in similar ways, then it is likely that other > > filesystems have similar problems and need to be checked. > Frankly, I'd like to handle the problem in the generic code rather than > having hacks in various filesystems. I have a patch back from 2009 which > implements a helper function which gets called when creating a hole (either > from ->setattr or ->write_end) and which handles this. It also has various > optimizations built in - it doesn't do anything when blocksize == pagesize > or when no hole block is actually created. Also it doesn't do any IO as you > do in XFS - it only writeprotects the page. I'll port the patch and try it > out with ext4. I'm not sure exactly how that helps - I'll understand better when I see the code ;) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html