On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote: > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote: > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@xxxxxxxxxx> wrote: > > > > > > > > From: Filipe Manana <fdmanana@xxxxxxxx> > > > > > > > > We always round down, to a multiple of the filesystem's block size, the > > > > length to deduplicate at generic_remap_check_len(). However this is only > > > > needed if an attempt to deduplicate the last block into the middle of the > > > > destination file is requested, since that leads into a corruption if the > > > > length of the source file is not block size aligned. When an attempt to > > > > deduplicate the last block into the end of the destination file is > > > > requested, we should allow it because it is safe to do it - there's no > > > > stale data exposure and we are prepared to compare the data ranges for > > > > a length not aligned to the block (or page) size - in fact we even do > > > > the data compare before adjusting the deduplication length. > > > > > > > > After btrfs was updated to use the generic helpers from VFS (by commit > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > > > > and deduplication")) we started to have user reports of deduplication > > > > not reflinking the last block anymore, and whence users getting lower > > > > deduplication scores. The main use case is deduplication of entire > > > > files that have a size not aligned to the block size of the filesystem. > > > > > > > > We already allow cloning the last block to the end (and beyond) of the > > > > destination file, so allow for deduplication as well. > > > > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@xxxxxxxxxxxxxx/ > > > > Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx> > > > > > > Darrick, Al, any feedback? > > > > Is there a fstest to check for correct operation of dedupe at or beyond > > source and destfile EOF? Particularly if one range is /not/ at EOF? > > Such as what generic/158 does already? Urk, heh. :) > > And that an mmap read of the EOF block will see zeroes past EOF before > > and after the dedupe operation? > > Can you elaborate a bit more? Why an mmap read and not a buffered or a > direct IO read before and after deduplication? > Is there anything special for the mmap reads on xfs, is that your > concern? Or is the idea to deduplicate while the file is mmap'ed? I cite mmap reads past EOF specifically because unlike buffered/direct reads where the VFS will stop reading exactly at EOF, a memory mapping maps in an entire memory page, and the fs is supposed to ensure that the bytes past EOF are zeroed. Hm now that I look at g/158 it doesn't actually verify mmap reads. I looked around and can't really see anything that checks mmap reads before and after a dedupe operation at EOF. > > If I fallocate a 16k file, write 'X' into the first 5000 bytes, > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k), > > should that work? > > You haven't mentioned the size of the second file, nor if the first > file has a size of 16K which I assume (instead of fallocate with the > keep size flag). Er, sorry, yes. The first file is 16,384 bytes long; the second file is 66,440 bytes. > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from > the first file into the range 60k - 60k + 5000 of the second file, and > that the second file has a size of 60k + 5000. Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from the first file into the second file (offset: 61440, length: 8192). The source range is entirely below EOF, and the dest range ends right at EOF in the second file. > If so, that fails with -EINVAL because the source range is not block > size aligned, and we already have generic fstests that test attempt to > duplication and clone non-aligned ranges that don't end at eof. > This patch doesn't change that behaviour, it only aims to allow > deduplication of the eof block of the source file into the eof of the > destination file. > > > > > > I'm convinced that we could support dedupe to EOF when the ranges of the > > two files both end at the respective file's EOF, but it's the weirder > > corner cases that I worry about... > > Well, we used to do that in btrfs before migrating to the generic code. > Since I discovered the corruption due to deduplication of the eof > block into the middle of a file in 2018's summer, the btrfs fix > allowed deduplication of the eof block only if the destination end > offset matched the eof of the destination file: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0 > > Since then no issues were found nor users reported any problems so far. <nod> I'm ok with that one scenario, it's the "one range ends at eof, the other doesn't" case that I'm picking on. :) (Another way to shut me up would be to run generic/52[12] with TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding out. :)) > Any other specific test you would like to see? No, just that. And mmap reads. :) --D > Thanks. > > > > > --D > > > > > Thanks. > > > > > > > --- > > > > fs/read_write.c | 10 ++++------ > > > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > > index 5bbf587f5bc1..7458fccc59e1 100644 > > > > --- a/fs/read_write.c > > > > +++ b/fs/read_write.c > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > > > > * else. Assume that the offsets have already been checked for block > > > > * alignment. > > > > * > > > > - * For deduplication we always scale down to the previous block because we > > > > - * can't meaningfully compare post-EOF contents. > > > > - * > > > > - * For clone we only link a partial EOF block above the destination file's EOF. > > > > + * For clone we only link a partial EOF block above or at the destination file's > > > > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > > > > + * destination file's EOF (can not link it into the middle of a file). > > > > * > > > > * Shorten the request if possible. > > > > */ > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > > > > if ((*len & blkmask) == 0) > > > > return 0; > > > > > > > > - if ((remap_flags & REMAP_FILE_DEDUP) || > > > > - pos_out + *len < i_size_read(inode_out)) > > > > + if (pos_out + *len < i_size_read(inode_out)) > > > > new_len &= ~blkmask; > > > > > > > > if (new_len == *len) > > > > -- > > > > 2.11.0 > > > >