Re: [PATCH v4 15/20] ext4: use ext4_zero_partial_blocks in punch_hole

Lukáš Czerner <lczerner@xxxxxxxxxx> · Mon, 17 Jun 2013 14:46:29 +0200 (CEST)

On Mon, 17 Jun 2013, Theodore Ts'o wrote:

> Date: Mon, 17 Jun 2013 08:25:18 -0400
> From: Theodore Ts'o <tytso@xxxxxxx>
> To: Lukáš Czerner <lczerner@xxxxxxxxxx>
> Cc: linux-ext4@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v4 15/20] ext4: use ext4_zero_partial_blocks in punch_hole
> 
> On Mon, Jun 17, 2013 at 11:08:32AM +0200, Lukáš Czerner wrote:
> > > Correction...  reverting patches #15 through #19 (which is what I did in
> > > the dev-with-revert branch found on ext4.git) causes the problem to go
> > > away in the nojournal case, but it causes a huge number of other
> > > problems.  Some of the reverts weren't clean, so it's possible I
> > > screwed up one of the reverts.  It's also possible that only applying
> > > part of this series leaves the tree in an unstable state.
> > > 
> > > I'd much rather figure out how to fix the problem on the dev branch,
> > > so thank you for looking into this!
> > 
> > Wow, this looks bad. Theoretically reverting patches %15 through
> > #19 should not have any real impact. So far I do not see what is
> > causing that, but I am looking into this.
> 
> I've been looking into this more intensively over the weekend.  I'm
> now beginning to think we have had a pre-existing race, and the
> changes in question has simply changed the timing.  I tried a version
> of the dev branch (you can find it as the branch dev2 in my
> kernel.org's ext4.git tree) which only had patches 1 through 10 of the
> invalidate page range patches (dropping patches 11 through 20), and I
> found that generic/300 was failing in the configuration ext3 (a file
> system with nodelalloc, no flex_bg, and no extents).  I also found
> the same failure with a 3.10-rc2 configuration.
> 
> The your changes seem to make generic/300 failure consistently for me
> using the nojournal configuration, but looking at patches in question,
> I don't think they could have directly caused the problem.  Instead, I
> think they just changed the timing to unmask the problem.

Ok, I though that there is something weird because patches #1-#14
should not cause anything like that and from my testing (see my
previous mail) it really seems it does not cause it, at least not
directly.

> 
> Given that I've seen generic/300 test failures in various different
> baselines going all the way back to 3.9-rc4, this isn't a recent
> regression.  And given that it does seem to be timing sensitive,
> bisecting it is going to be difficult.  On the other hand, given that
> using the dev (or master) branch, generic/300 is failing with a
> greater than 70% probability using kvm with 2 cpu's, 2 megs of RAM and
> 5400 rpm laptop drives in nojournal mode, the fact that it's
> reproducing relatively reliably hopefully will make it easier to find
> the problem.

As mentioned in previous email test generic/300 runs without any
problems (even in the loop) without journal with patches #1 through
#14 applied on 3.10-rc2 (c7788792a5e7b0d5d7f96d0766b4cb6112d47d75).
This is on kvm with 24 cpu's, 8GB of RAM (I suppose you're not using
2MB of ram in your setup, but rather 2GB :) and server drives with
linear lvm on top of it.

-Lukas

> 
> > I see that there are problems in other mode, not just nojournal. Are
> > those caused by this as well, or are you seeing those even without
> > the patchset ?
> 
> I think the other problems in my dev-with-revert branch was caused by
> some screw up on my part when did the revert using git.  I found that
> dropping the patches from a copy of the guilt patch stack, and then
> applying all of the patches except for the last half of the invalidate
> page range patch series, resulted in a clean branch that didn't have
> any of these failures.  It's what I should have done late last week,
> instead of trying to use "git revert".
> 
> Cheers,
> 
> 					- Ted
>