Re: [Lsf-pc] [LSF/MM TOPIC] Working towards better power fail testing

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 5 Jan 2015 14:33:39 -0500

On Mon, Jan 05, 2015 at 11:13:28AM -0800, Sage Weil wrote:
> On Mon, 5 Jan 2015, Brian Foster wrote:
> > On Mon, Jan 05, 2015 at 10:34:57AM -0800, Sage Weil wrote:
> > > On Wed, 10 Dec 2014, Josef Bacik wrote:
> > > > On 12/10/2014 06:27 AM, Jan Kara wrote:
> > > > > On Mon 08-12-14 17:11:41, Josef Bacik wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > We have been doing pretty well at populating xfstests with loads of
> > > > > > tests to catch regressions and validate we're all working properly.
> > > > > > One thing that has been lacking is a good way to verify file system
> > > > > > integrity after a power fail.  This is a core part of what file
> > > > > > systems are supposed to provide but it is probably the least tested
> > > > > > aspect.  We have dm-flakey tests in xfstests to test fsync
> > > > > > correctness, but these tests do not catch the random horrible things
> > > > > > that can go wrong.  We are still finding horrible scary things that
> > > > > > go wrong in Btrfs because it is simply hard to reproduce and test
> > > > > > for.
> > > > > > 
> > > > > > I have been working on an idea to do this better, some may have seen
> > > > > > my dm-power-fail attempt, and I've got a new incarnation of the idea
> > > > > > thanks to discussions with Zach Brown.  Obviously there will be a
> > > > > > lot changing in this area in the time between now and March but it
> > > > > > would be good to have everybody in the room talking about what they
> > > > > > would need to build a good and deterministic test to make sure we're
> > > > > > always giving a consistent file system and to make sure our fsync()
> > > > > > handling is working properly.  Thanks,
> > > > >    I agree we are lacking in testing this aspect. Just I don't see too much
> > > > > material for discussion there, unless we have something more tangible -
> > > > > when we have some implementation, we can talk about pros and cons of it,
> > > > > what still needs doing etc.
> > > > > 
> > > > 
> > > > Right that's what I was getting at.  I have a solution and have sent it around
> > > > but there doesn't seem to be too many people interested in commenting on it.
> > > > I figure one of two things will happen
> > > > 
> > > > 1) My solution will go in before LSF, in which case YAY my job is done and
> > > > this is more of an [ATTEND] than a [TOPIC], or
> > > > 
> > > > 2) My solution hasn't gone in yet and I'd like to discuss my methodology and
> > > > how we can integrate it into xfstests, future features, other areas we could
> > > > test etc.
> > > > 
> > > > Maybe not a full blown slot but combined with a overall testing slot or hell
> > > > just a quick lightening talk.  Thanks,
> > > 
> > > I have a related topic that may make sense to fit into any discussion 
> > > about this. Twice recently we've run into trouble using newish or less 
> > > common (combinations of) syscalls.
> > > 
> > > The first instance was with the use of sync_file_range to try to 
> > > control/limit the amount of dirty data in the page cache.  This, possibly 
> > > in combination with posix_fadvise(DONTNEED), managed to break the 
> > > writeback sequence in XFS and led to data corruption after power loss.
> > > 
> > 
> > Was there a report or any other details on this one? In particular, I'm
> > wondering if this is related to the problem exposed by xfstests test
> > xfs/053...
> 
> This is the original thread:
> 
> 	http://oss.sgi.com/archives/xfs/2013-06/msg00066.html
> 

Thanks. It does look similar to xfs/053, the intent of which was to
indirectly create the kind of writeback pattern that exposes this.

> Looks like 053 is about ACLs though?
> 

generic/053 does something with ACLs, xfs/053 is the test of interest.
Regardless, from the thread above it sounds like Dave had honed in on
the cause.

Brian

> sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html