On Wed, Feb 27, 2019 at 02:27:49PM -0500, Josef Bacik wrote: > On Wed, Feb 27, 2019 at 12:13:37PM -0500, Brian Foster wrote: > > On Wed, Feb 27, 2019 at 10:54:20AM -0500, Josef Bacik wrote: > > > On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote: > > > > On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote: > > > > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > > > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > > > > > > tracks underlying FUAs and restores the filesystem to various points > > > > > > > in time based on FUA marks. This allows fstests to check fs > > > > > > > consistency at various points and verify log recovery works as > > > > > > > expected. > > > > > > > > > > > > > > > > > > > Inaccurate. generic/482 restores to FUA points. > > > > > > generic/45[57] restore to user defined points in time (marks). > > > > > > dm-log-writes mechanism is capable of restoring either. > > > > > > > > > > > > > > > > The above is poorly worded. I'm aware of the separate tests and I've > > > > > used the mechanism to bounce around to various marks. Note that my > > > > > understanding of the mechanism beyond that is rudimentary. I'll reword > > > > > this if the patch survives, but it sounds like there may be opportunity > > > > > to fix the mechanism, which clearly would be ideal. > > > > > > > > > > > > This mechanism does not play well with LSN based log recovery > > > > > > > ordering behavior on XFS v5 superblocks, however. For example, > > > > > > > generic/482 can reproduce false positive corruptions based on extent > > > > > > > to btree conversion of an inode if the inode and associated btree > > > > > > > block are written back after different checkpoints. Even though both > > > > > > > items are logged correctly in the extent-to-btree transaction, the > > > > > > > btree block can be relogged (multiple times) and only written back > > > > > > > once when the filesystem unmounts. If the inode was written back > > > > > > > after the initial conversion, recovery points between that mark and > > > > > > > when the btree block is ultimately written back will show corruption > > > > > > > because log recovery sees that the destination buffer is newer than > > > > > > > the recovered buffer and intentionally skips the buffer. This is a > > > > > > > false positive because the destination buffer was resiliently > > > > > > > written back after being physically relogged one or more times. > > > > > > > > > > > > > > > > > > > This story doesn't add up. > > > > > > Either dm-log-writes emulated power failure correctly or it doesn't. > > > > > > > > > > It doesn't. It leaves the log and broader filesystem in a state that > > > > > makes no sense with respect to a power failure. > > > > > > > > > > > My understanding is that the issue you are seeing is a result of > > > > > > XFS seeing "data from the future" after a restore of a power failure > > > > > > snapshot, because the scratch device is not a clean slate. > > > > > > If I am right, then the correct solution is to wipe the journal before > > > > > > starting to replay restore points. > > > > > > > > > > > > Am I misunderstanding whats going on? > > > > > > > > > > > > > > > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the > > > > > broader filesystem before recovering from the initial fua and replaying > > > > > in order from there would mitigate the problem. Is there an easy way to > > > > > test that theory? For example, would a mkfs of the scratch device before > > > > > the replay sequence of generic/482 begins allow the test to still > > > > > otherwise function correctly? > > > > > > > > > > > > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't > > > > clear the device either. I ended up reproducing the problem, physically > > > > zeroing the device, replaying the associated FUA and observing the > > > > problem go away. From there, if I replay to the final FUA mark and go > > > > back to the (originally) problematic FUA, the problem is reintroduced. > > > > > > > > > > Sorry guys, whenever I run log-writes on xfs I use my helper script here > > > > > > https://github.com/josefbacik/log-writes > > > > > > specifically replay-individual-faster.sh. This creates a snapshot at every > > > replay point, mounts and checks the fs, and then destroys the snapshot and keeps > > > going. This way you don't end up with the "new" data still being on the device. > > > It's not super fast, but this is usually a fire and forget sort of thing. I > > > could probably integrate this into xfstests for our log-writes tests, those tend > > > to not generate large logs so wouldn't take super long. Does this fix the > > > problem for you Brian? > > > > > > > Thanks Josef. At a glance at that script I'm not quite following how > > this fits together. Are you taking a snapshot of the original device > > before the workload being tested is run against dm-log-writes, then > > replaying on top of that? In general, anything that puts the device back > > into the state from before the workload ran and replays from there > > should be enough to fix the problem I think. As long as a replay > > sequence runs in order, I don't think snapshots of each replay point > > should technically be necessary (vs a single replay snapshot, for e.g.). > > > > Well we do this so we can mount/unmount the fs to get the log replay to happen, > and then run fsck. We want the snapshot so that we can roll back the changes of > the log replay. > > > Note that I ended up testing generic/482 with a loop device for a > > scratch device and that also worked around the problem, I suspect > > because that allowed the aforementioned discards to actually reset the > > underlying the device via loop discard->hole punch (which doesn't occur > > with my original, non-loop scratch dev). So it seems that another option > > for more deterministic behavior in fstests could be to just enforce the > > use of a loop device as the underlying target in these particular tests. > > For example, have the log-writes init create a file on the test device > > and loop mount that rather than using the scratch device. Just a thought > > though, it's not clear to me that's any more simple than what you have > > already implemented here.. > > > > I'm good either way. What I've done here isn't simple, it's more meant for my > long running tests. Doing loop and discarding the whole device would be cool, > but I'm afraid of punch hole bugs in the underlying file system making things > weird, or running on kernels where loop doesn't do punch hole or the fs doesn't > support punch hole. > True.. > Each solution has its drawbacks. Adding a bunch of infrastructure to do > snapshots isn't super awesome, but may be the most flexible. The loop solution > has the added benefit of also testing punch hole/discard ;). > As yet another option, it looks like fstests already has thin pool infrastructure in common/dmthin. I ran the same test using a manually created thin volume for SCRATCH_DEV instead of a loop device. I see a performance drop, but no failures so far and the test still runs in the low-to-mid 200s range, which seems reasonable enough to me. Brian > Josef