On Mon, Jan 05, 2015 at 10:34:57AM -0800, Sage Weil wrote: > On Wed, 10 Dec 2014, Josef Bacik wrote: > > On 12/10/2014 06:27 AM, Jan Kara wrote: > The first instance was with the use of sync_file_range to try to > control/limit the amount of dirty data in the page cache. This, possibly > in combination with posix_fadvise(DONTNEED), managed to break the > writeback sequence in XFS and led to data corruption after power loss. Corruption after power loss is not brilliant behaviour from XFS here, but I'll point out for the wider audience that sync_file_range() provides absolutely no data integrity guarantees for power loss situations. It's really, really badly named because it doesn't give the same guarantees as other "sync" functions filesystems provide. IOWs, if you value your data, the only interface you can rely for data integrity is fsync/fdatasync... > The other issue we saw was just a general raft of FIEMAP bugs over the > last year or two. We saw cases where even after fsync a fiemap result > would not include all extents, and (not unexpectedly) lots of corner cases > in several file systems, e.g., around partial blocks at end of file. (As > far as I know everything we saw is resolved in current kernels.) Again, this is probably more a misunderstanding of FIEMAP than anything. FIEMAP is *advisory* and gives no output accuracy guarantees as userspace cannot prevent the extent maps from changing at any time. As an example, see the aborted attempt by the 'cp' utility to use FIEMAP to detect holes when copying sparse files.... > I'm not so concerned with these specific bugs, but worried that we > (perhaps naively) expected them to be pretty safe. Perhaps for FIEMAP > this is a general case where a newish syscall/ioctl should be tested > carefully with our workloads before being relied upon, and we could have > worked to make sure e.g. xfstests has appropriate tests. Oh, it does - that's why it mostly works now across all filesystems that are regularly tested with xfstests. > For power fail > testing in particular, though, right now it isn't clear who is testing > what under what workloads, so the only really "safe" approach is to stick > to whatever syscall combinations we think the rest of the world is using, > or make sure we test ourselves. Write tests for the regression test suite that filesystem developers run all the time. ;) > As things stand now the other devs are loathe to touch any remotely exotic > fs call, but that hardly seems ideal. Hopefully a common framework for > powerfail testing can improve on this. Perhaps there are other ways we > make it easier to tell what is (well) tested, and conversely ensure that > those tests are well-aligned with what real users are doing... We don't actually need power failure (or even device failure) infrastructure to test data integrity on failure. Filesystems just need a shutdown method that stops any IO from being issued once the shutdown flag is set. XFS has this and it's used by xfstests via the "godown" utility to shut the fileystem down in various circumstances. We've been using this for data integrity and log recovery testing in xfstests for many years. Hence we know if the device behaves correctly w.r.t cache flushes and FUA then the filesystem will behave correctly on power loss. We don't need a device power fail simulator to tell us violating fundamental architectural assumptions will corrupt filesystems.... Unfortunately, nobody else seems to want to implement shutdown traps, even though it massively improves reliability as it results in extensive error path testing that can't otherwise be easily exercised... So, from an xfstests persepctive, I'd much prefer to see XFS_IOC_GOINGDOWN implemented by other filesystems and have the tests that use it made generic first. Filesystems need to handle themselves correctly in simple error conditions before we even start to consider creating esoteric failure conditions... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html