Re: corruption of active mmapped files in btrfs snapshots

Chris Mason <clmason@xxxxxxxxxxxx> · Fri, 22 Mar 2013 08:07:54 -0400

Quoting Alexandre Oliva (2013-03-22 01:27:42)
> On Mar 21, 2013, Chris Mason <chris.mason@xxxxxxxxxxxx> wrote:
> 
> > Quoting Chris Mason (2013-03-21 14:06:14)
> >> With mmap the kernel can pick any given time to start writing out dirty
> >> pages.  The idea is that if the application makes more changes the page
> >> becomes dirty again and the kernel writes it again.
> 
> That's the theory.  But what if there's some race between the time the
> page is frozen for compressing and the time it's marked as clean, or
> it's marked as clean after it's further modified, or a subsequent write
> to the same page ends up overridden by the background compression of the
> old contents of the page?  These are all possibilities that come to mind
> without knowing much about btrfs inner workings.

Definitely, there is a lot of room for racing.  Are you using
compression in btrfs or just in leveldb?

> 
> >> So the question is, can you trigger this without snapshots being done
> >> at all?
> 
> I haven't tried, but I now have a program that hit the error condition
> while taking snapshots in background with small time perturbations to
> increase the likelihood of hitting a race condition at the exact time.
> It uses leveldb's infrastructure for the mmapping, but it shouldn't be
> too hard to adapt it so that it doesn't.
> 
> > So my test program creates an 8GB file in chunks of 1MB each.
> 
> That's probably too large a chunk to write at a time.  The bug is
> exercised with writes slightly smaller than a single page (although
> straddling across two consecutive pages).
> 
> This half-baked test program (hereby provided under the terms of the GNU
> GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O
> will be performed with write()s, another that will get the same data
> appended with leveldb's mmap-based output interface.  Random block
> sizes, as well as milli and microsecond timing perturbations, are read
> from /dev/urandom, and the rest of the output buffer is filled with
> (char)1.
> 
> The test that actually failed (on the first try!, after some other
> variations that didn't fail) didn't have any of the #ifdef options
> enabled (i.e., no -D* flags during compilation), but it triggered the
> exact failure observed with ceph: zeros at the end of a page where there
> should have been nonzero data, followed by nonzero data on the following
> page!  That was within snapshots, not in the main subvol, but hopefully
> it's the same problem, just a bit harder to trigger.

I'd like to take snapshots out of the picture for a minute.  We need
some way to synchronize the leveldb with snapshotting because the
snapshot is basically the same thing as a crash from a db point of view.

Corrupting the main database file is a much different (and bigger)
problem.

-chris

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html