Re: [Lsf-pc] [LSF/MM TOPIC] Working towards better power fail testing

Jan Kara <jack@xxxxxxx> · Mon, 5 Jan 2015 22:17:10 +0100

On Mon 05-01-15 10:34:57, Sage Weil wrote:
> On Wed, 10 Dec 2014, Josef Bacik wrote:
> > On 12/10/2014 06:27 AM, Jan Kara wrote:
> > > On Mon 08-12-14 17:11:41, Josef Bacik wrote:
> > > > Hello,
> > > > 
> > > > We have been doing pretty well at populating xfstests with loads of
> > > > tests to catch regressions and validate we're all working properly.
> > > > One thing that has been lacking is a good way to verify file system
> > > > integrity after a power fail.  This is a core part of what file
> > > > systems are supposed to provide but it is probably the least tested
> > > > aspect.  We have dm-flakey tests in xfstests to test fsync
> > > > correctness, but these tests do not catch the random horrible things
> > > > that can go wrong.  We are still finding horrible scary things that
> > > > go wrong in Btrfs because it is simply hard to reproduce and test
> > > > for.
> > > > 
> > > > I have been working on an idea to do this better, some may have seen
> > > > my dm-power-fail attempt, and I've got a new incarnation of the idea
> > > > thanks to discussions with Zach Brown.  Obviously there will be a
> > > > lot changing in this area in the time between now and March but it
> > > > would be good to have everybody in the room talking about what they
> > > > would need to build a good and deterministic test to make sure we're
> > > > always giving a consistent file system and to make sure our fsync()
> > > > handling is working properly.  Thanks,
> > >    I agree we are lacking in testing this aspect. Just I don't see too much
> > > material for discussion there, unless we have something more tangible -
> > > when we have some implementation, we can talk about pros and cons of it,
> > > what still needs doing etc.
> > > 
> > 
> > Right that's what I was getting at.  I have a solution and have sent it around
> > but there doesn't seem to be too many people interested in commenting on it.
> > I figure one of two things will happen
> > 
> > 1) My solution will go in before LSF, in which case YAY my job is done and
> > this is more of an [ATTEND] than a [TOPIC], or
> > 
> > 2) My solution hasn't gone in yet and I'd like to discuss my methodology and
> > how we can integrate it into xfstests, future features, other areas we could
> > test etc.
> > 
> > Maybe not a full blown slot but combined with a overall testing slot or hell
> > just a quick lightening talk.  Thanks,
> 
> I have a related topic that may make sense to fit into any discussion 
> about this. Twice recently we've run into trouble using newish or less 
> common (combinations of) syscalls.
> 
> The first instance was with the use of sync_file_range to try to 
> control/limit the amount of dirty data in the page cache.  This, possibly 
> in combination with posix_fadvise(DONTNEED), managed to break the 
> writeback sequence in XFS and led to data corruption after power loss.
> 
> The other issue we saw was just a general raft of FIEMAP bugs over the 
> last year or two. We saw cases where even after fsync a fiemap result 
> would not include all extents, and (not unexpectedly) lots of corner cases 
> in several file systems, e.g., around partial blocks at end of file.  (As 
> far as I know everything we saw is resolved in current kernels.)
> 
> I'm not so concerned with these specific bugs, but worried that we 
> (perhaps naively) expected them to be pretty safe.  Perhaps for FIEMAP 
> this is a general case where a newish syscall/ioctl should be tested 
> carefully with our workloads before being relied upon, and we could have 
> worked to make sure e.g. xfstests has appropriate tests.  For power fail 
> testing in particular, though, right now it isn't clear who is testing 
> what under what workloads, so the only really "safe" approach is to stick 
> to whatever syscall combinations we think the rest of the world is using, 
> or make sure we test ourselves.
  So I think we are getting better at providing testcases for new APIs than
we used to be.  I also think fs maintainers are aware of the need to create
xfstests tests if there is any new API introduced. So I don't think we can
do much more than write more tests :)

As Josef and you correctly wrote, powerfail testing is one area where we
are rather poor. Another area which comes to my mind is testing under
memory pressure (which is doable using error injection framework, I just
don't think anybody has put the necessary effort into actually running
that).

So probably we can speak about areas that need improving and what needs
doing there but we also need people to actually do the work...

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html