git mv fails to deduplicate blob objects on transfer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"git mv" followed by "git push" or "git pull"
can produce wasteful transfers of blob objects

these transfers are wasteful
because the blob object already exists in the destination repo
but "git push" or "git pull" fail to see that

this affects only some cases of "git mv"
in some cases, the deduplication works as expected
in other cases, dedup fails

this is neglegible for small files, but noticable with large files

in my case, i tried to move 5GB of files (250 x 20MB)
and i was surprised as "git push" wanted to transfer 5GB
instead of a few bytes for the tree and commit objects

to reproduce: see repro.sh in
https://github.com/milahu/git-bug-git-mv-wasteful-transfer



output of repro.sh

the first size is the transfer size before "git mv"
the second size is the transfer size after "git mv"

pass: 1.00 MiB != 288 bytes # path_a=file_a; path_b=file_b
pass: 1.00 MiB != 331 bytes # path_a=dir/file_a; path_b=dir/file_b
pass: 1.00 MiB != 286 bytes # path_a=dir_a/file; path_b=dir_b/file
pass: 1.00 MiB != 284 bytes # path_a=file; path_b=dir/file
pass: 1.00 MiB != 329 bytes # path_a=file; path_b=dir1/dir2/file
pass: 1.00 MiB != 373 bytes # path_a=file; path_b=dir1/dir2/dir3/file
pass: 1.00 MiB != 331 bytes # path_a=file_a; path_b=dir/file_b
pass: 1.00 MiB != 376 bytes # path_a=file_a; path_b=dir1/dir2/file_b
pass: 1.00 MiB != 420 bytes # path_a=file_a; path_b=dir1/dir2/dir3/file_b
pass: 1.00 MiB != 241 bytes # path_a=dir/file; path_b=file
FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/file; path_b=file
FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/dir3/file; path_b=file
FAIL: 1.00 MiB == 1.00 MiB # path_a=dir/file_a; path_b=file_b
FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/file_a; path_b=file_b
FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/dir3/file_a; path_b=file_b
FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1a/dir2a/file; path_b=dir1b/dir2b/file
FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1a/file_a; path_b=dir1b/file_b



see also

https://colabti.org/ircloggy/git/2024-02-24#l704

https://colabti.org/ircloggy/git/2024-02-25#l218

https://colabti.org/ircloggy/git/2024-02-25#l269

> I have a strong déjà vu about this also; I think we talked about this exact thing a while ago

> this, same: https://colabti.org/ircloggy/git/2023-09-13#l912

https://colabti.org/ircloggy/git/2024-02-25#l404

https://colabti.org/ircloggy/git/2024-02-25#l433

> reading that SO answer by jthill but can't quite get the whole picture from it -- is it saying that it's a trade-off in sending all the objects vs. spending resources trying to figure out what to send?

> I bet there's an opportunity for optimization here; Git could probably figure out a good balance based on how much data it is about to send

https://stackoverflow.com/questions/48228425/git-push-new-branch-with-same-files-uploads-all-files-again





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux