On 2022/10/5 2:26, Darrick J. Wong wrote:
Notice this line in generic/470:
$XFS_IO_PROG -t -c "truncate $LEN" -c "mmap -S 0 $LEN" -c "mwrite 0 $LEN" \
-c "log_writes -d $LOGWRITES_NAME -m preunmap" \
-f $SCRATCH_MNT/test
The second xfs_io command creates a MAP_SYNC mmap of the
SCRATCH_MNT/test file, and the third command memcpy's bytes to the
mapping to invoke the write page fault handler.
The fourth command tells the dm-logwrites driver for $LOGWRITES_NAME
(aka the block device containing the mounted XFS filesystem) to create a
mark called "preunmap". This mark captures the exact state of the block
device immediately after the write faults complete, so that we can come
back to it later. There are a few things to note here:
(1) We did not tell the fs to persist anything;
(2) We can't use dm-snapshot here, because dm-snapshot will flush the
fs (I think?); and
(3) The fs is still mounted, so the state of the block device at the
mark reflects a dirty XFS with a log that must be replayed.
The next thing the test does is unmount the fs, remove the dm-logwrites
driver to stop recording, and check the fs:
_log_writes_unmount
_log_writes_remove
_dmthin_check_fs
This ensures that the post-umount fs is consistent. Now we want to roll
back to the place we marked to see if the mwrite data made it to pmem.
It*should* have, since we asked for a MAP_SYNC mapping on a fsdax
filesystem recorded on a pmem device:
# check pre-unmap state
_log_writes_replay_log preunmap $DMTHIN_VOL_DEV
_dmthin_mount
dm-logwrites can't actually roll backwards in time to a mark, since it
only records new disk contents. It/can/ however roll forward from
whatever point it began recording writes to the mark, so that's what it
does.
However -- remember note (3) from earlier. When we _dmthin_mount after
replaying the log to the "preunmap" mark, XFS will see the dirty XFS log
and try to recover the XFS log. This is where the replay problems crop
up. The XFS log records a monotonically increasing sequence number
(LSN) with every log update, and when updates are written into the
filesystem, that LSN is also written into the filesystem block. Log
recovery also replays updates into the filesystem, but with the added
behavior that it skips a block replay if the block's LSN is higher than
the transaction being replayed. IOWs, we never replay older block
contents over newer block contents.
For dm-logwrites this is a major problem, because there could be more
filesystem updates written to the XFS log after the mark is made. LSNs
will then be handed out like this:
mkfs_lsn preunmap_lsn umount_lsn
| | |
|--------------------------||----------|-----------|
| |
xxx_lsn yyy_lsn
Let's say that a new metadata block "BBB" was created in update "xxx"
immediately before the preunmap mark was made. Per (1), we didn't flush
the filesystem before taking the mark, which means that the new block's
contents exist only in the log at this point.
Let us further say that the new block was again changed in update "yyy",
where preunmap_lsn < yyy_lsn <= umount_lsn. Clearly, yyy_lsn > xxx_lsn.
yyy_lsn is written to the block at unmount, because unmounting flushes
the log clean before it completes. This is the first time that BBB ever
gets written.
_log_writes_replay_log begins replaying the block device from mkfs_lsn
towards preunmap_lsn. When it's done, it will have a log that reflects
all the changes up to preunmap_lsn. Recall however that BBB isn't
written until after the preunmap mark, which means that dm-logwrites has
no record of BBB before preunmap_lsn, so dm-logwrites replay won't touch
BBB. At this point, the block header for BBB has a UUID that matches
the filesystem, but a LSN (yyy_lsn) that is beyond preunmap_lsn.
XFS log recovery starts up, and finds transaction xxx. It will read BBB
from disk, but then it will see that it has an LSN of yyy_lsn. This is
larger than xxx_lsn, so it concludes that BBB is newer than the log and
moves on to the next log item. No other log items touch BBB, so
recovery finishes, and now we have a filesystem containing one metadata
block (BBB) from the future. This is an inconsistent filesystem, and
has caused failures in the tests that use logwrites.
To work around this problem, all we really need to do is reinitialize
the entire block device to known contents at mkfs time. This can be
done expensively by writing zeroes to the entire block device, or it can
be done cheaply by (a) issuing DISCARD to the whole the block device at
the start of the test and (b) ensuring that reads after a discard always
produce zeroes. mkfs.xfs already does (a), so the test merely has to
ensure (b).
dm-thinp is the only software solution that provides (b), so that's why
this test layers dm-logwrites on top of dm-thinp on top of $SCRATCH_DEV.
This combination used to work, but with the pending pmem/blockdev
divorce, this strategy is no longer feasible.
Hi Darrick,
Thanks a lot for your detailed explanation.
Could you tell me if my understanding is correct. I think the issue is
that log-writes log and XFS log may save the different state of block
device. It is possible for XFS log to save the more updates than
log-writes log does. In this case, we can recovery the block device by
log-writes log's replay but we will get the inconsistent filesystem when
mounting the block device because the mount operation will try to
recovery more updates for XFS on the block deivce by XFS log. We need to
fix the issue by discarding XFS log on the block device. mkfs.xfs will
try to discard the blocks including XFS log by calling ioctl(BLKDISCARD)
but it will ignore error silently when the block device doesn't
support ioctl(BLKDISCARD). Discarding XFS log is what you said
"reinitialize the entire block device", right?
I think the only way to fix this test is (a) revert all of Christoph's
changes so far and scuttle the divorce; or (b) change this test like so:
Sorry, I didn't know which Christoph's patches need to be reverted?
Could you tell me the URL about Christoph's patches?
1. Create a large sparse file on $TEST_DIR and losetup that sparse
file. The resulting loop device will not have dax capability.
2. Set up the dmthin/dmlogwrites stack on top of this loop device.
3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem
device) as the realtime device, and set the daxinherit and rtinherit
flags on the root directory. The result is a filesystem with a data
section that the kernel will treat as a regular block device, a
realtime section backed by pmem, and the necessary flags to make
sure that the test file will actually get fsdax mode.
4. Acknowledge that we no longer have any way to test MAP_SYNC
functionality on ext4, which means that generic/470 has to move to
tests/xfs/.
Sorry, I didn't understand why the above test change can fix the issue.
Could you tell me which step can discard XFS log?
In addition, I don't like your idea about the test change because it
will make generic/470 become the special test for XFS. Do you know if we
can fix the issue by changing the test in another way? blkdiscard -z can
fix the issue because it does zero-fill rather than discard on the block
device. However, blkdiscard -z will take a lot of time when the block
device is large.
Best Regards,
Xiao Yang
--D
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel