Re: mv is massively slower on the host rather than in a nspawn chroot, regression somewhere?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 24 Jan 2022 14:56:09 -0700

On Mon, Jan 24, 2022 at 11:41 AM Robert-André Mauchin <zebob.m@xxxxxxxxx> wrote:
>
> On 1/24/22 05:14, Chris Murphy wrote:

> > What file system is being used in each case?
> >
>
> Everything is btrfs.
>
> > This is a bit obscure but... cp and mv use reflink=auto. On XFS and
> > Btrfs this means it'll make reflinks (copies metadata, doesn't
> > duplicate the data extents) if it can. Falling back to a full copy
> > (metadata and data extents).
> >
> But both the host and the nspawn container are using btrfs?

Should be true, and if this nspawn container is running on the host
then they should share the same btrfs file system. And even if nspawn
is creating separate subvolumes for the mock build (?not sure if it
does) then because it's a nested subvolume, not mounted, there's no
mount point boundary to cross so you *do* get reflink copies between
subvolumes.

> > It might not be possible due to an obscure VFS rule that disallows
> > reflinks (for reasons I don't understand) when the copy or move
> > crosses mount point boundaries. This includes bind mounts of
> > directories. Bind mounts are also what are employed behind the scene
> > with 'mount -o subvol' mount option on Btrfs, which we use by default
> > in Fedora Workstation and Cloud Edition, and all the desktop spins.
> >
> > The nspawn container, I'm not super familiar with how it works. I
> > think on Btrfs, it will create nested subvolumes, i.e. they are not
> > mounted with the subvol mount option, hence no mount point boundary.
> > But on other file systems, I think nspawn creates a loop mounted file
> > system?
> >
> >
> I've got two subvol:
>
> UUID=ee9eec69-8710-4503-b389-e16fcde8a0a5 /                       btrfs
>    subvol=root,compress=zstd:1 0 0
>
> UUID=d7e21336-6ac6-483a-b4f2-aaeecabd8f1f /home                   btrfs
>    subvol=home,compress=zstd:1 0 0
>
> but when I do my tests there is no subvol crossing, everything happens
> on the root subvol?

It might be there's a nested subvolume created by nspawn (I'm not
sure), so maybe part of it happens in some other subvolume. But there
should still be an efficient (reflink) copy.

If cp or mv aren't literally invoked, and the copy is done by some
library then we'd need to find out what ioctl is actually being
called. For example upstream coreutils only just recently cut a new
release v9.0 (only in rawhide) that has the enhancement for cp to use
reflink=auto. It was previously reflink=never which is what's used
most everywhere else other than Fedora.

$ strace cp --reflink=always A B
...
ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = 0

$ strace cp --reflink=never A B
...
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x7faf80f5e000
read(3, "2022/01/12 13:48:21 Starting Blu"..., 131072) = 1756
write(4, "2022/01/12 13:48:21 Starting Blu"..., 1756) = 1756

Sorry though if this is a goose chase. I can't tell if it's a factor
in what's going on. But maybe someone else will find this interesting
:D There is a mostly reliable way to determine if a file is a reflink
copy.

Before the copy, look at the file:

$ filefrag -v A
...
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      14:   10103641..  10103655:     15:
last,encoded,eof
...

The key take away is 10103641. Let's copy it within the same directory:

$ cp A B
$ filefrag -v B
...
   0:        0..      14:   10103641..  10103655:     15:
last,encoded,shared,eof
...

Again 1013641. So the data extent location is the same, which is only
possible with a reflink copy, and hence how reflinks go by a more
technical name, shared extents. And you also see in the flags column
"shared". That flag is only there because both A and B exist. If I
remove A or B, such that there is only one file using those extents,
they're no longer shared, so the "shared" flag won't be there. Hence
my emphasis on the address. There *is* logical block address reuse in
Btrfs but due to COW, it's not going to be reused less than about a
minute.

$ cp --reflink=never A C
$ filefrag -v C
...
   0:        0..      14:   10398358..  10398372:     15:
last,encoded,eof

Different location because the data extents were duplicated, not shared.

This is the same on XFS too. The subtle differences maybe don't matter
here much. A btrfs subvolume does have it's own st_dev, so things like
rsync -x and borg will not cross subvolume boundaries.

Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure