Quoting Alexandre Oliva (2013-03-22 01:27:42) > On Mar 21, 2013, Chris Mason <chris.mason@xxxxxxxxxxxx> wrote: > > > Quoting Chris Mason (2013-03-21 14:06:14) > >> With mmap the kernel can pick any given time to start writing out dirty > >> pages. The idea is that if the application makes more changes the page > >> becomes dirty again and the kernel writes it again. > > That's the theory. But what if there's some race between the time the > page is frozen for compressing and the time it's marked as clean, or > it's marked as clean after it's further modified, or a subsequent write > to the same page ends up overridden by the background compression of the > old contents of the page? These are all possibilities that come to mind > without knowing much about btrfs inner workings. Definitely, there is a lot of room for racing. Are you using compression in btrfs or just in leveldb? > > >> So the question is, can you trigger this without snapshots being done > >> at all? > > I haven't tried, but I now have a program that hit the error condition > while taking snapshots in background with small time perturbations to > increase the likelihood of hitting a race condition at the exact time. > It uses leveldb's infrastructure for the mmapping, but it shouldn't be > too hard to adapt it so that it doesn't. > > > So my test program creates an 8GB file in chunks of 1MB each. > > That's probably too large a chunk to write at a time. The bug is > exercised with writes slightly smaller than a single page (although > straddling across two consecutive pages). > > This half-baked test program (hereby provided under the terms of the GNU > GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O > will be performed with write()s, another that will get the same data > appended with leveldb's mmap-based output interface. Random block > sizes, as well as milli and microsecond timing perturbations, are read > from /dev/urandom, and the rest of the output buffer is filled with > (char)1. > > The test that actually failed (on the first try!, after some other > variations that didn't fail) didn't have any of the #ifdef options > enabled (i.e., no -D* flags during compilation), but it triggered the > exact failure observed with ceph: zeros at the end of a page where there > should have been nonzero data, followed by nonzero data on the following > page! That was within snapshots, not in the main subvol, but hopefully > it's the same problem, just a bit harder to trigger. I'd like to take snapshots out of the picture for a minute. We need some way to synchronize the leveldb with snapshotting because the snapshot is basically the same thing as a crash from a db point of view. Corrupting the main database file is a much different (and bigger) problem. -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html