Re: Add a "Flattened Cache" to `git --clone`?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



For a repo like git itself, the assertions regarding the way git
currently builds its data (in fact, including the `checkout` portion)
does compete directly with the "cached result" methodology! Holy shit
guys, I'm impressed as hell.

tl;dr: The way I read the raw numbers, `git` ends up being as-fast-as
(or faster) than a "cache" of the .git folder. Without doing further
research, I'm inclined to agree with the previously mentioned bitmap
method already being effectively as efficient as (more efficient
than!?) a cache.


Methodology/Reasoning:
virtualized: verified zero network chatter on eth0 before and after each test.
tcpflow: to gather the bits for the entire transaction... from just
before the execution of `git clone` was started, and closing the
listener just after execution ended. (not worrying about
protocols/overhead)
tar: to compare the size of the repository on disk with the tcpflow
results. (not worrying about compensating for
headers/metadata/overhead)
gzip: to theoretically, I haven't checked anything, compensate for
seemingly arbitrary size differences when downloading over HTTPS.
time: (really) rough measure of execution time.


Commands used to generate files:
*.tcpflow: `sudo tcpflow -p -c -i eth0 > $filename.tcpflow`
*.tar: `tar cf $filename.tar .git`
*.gz: `gzip -9 $filename.tar`


Results:

75M kernelorg.tar
72M kernelorg.tar.gz
69M kernelorg_git.tcpflow
69M kernelorg_https.tcpflow

145M github.tar
143M github.tar.gz
143M github_git.tcpflow
142M github_https.tcpflow


Other Tests (sanity checks):

Cloned a gitea mirror of kernel.org's git:
69M gitea_git.tcpflow
69M gitea_https.tcpflow

Cloned a bitbucket mirror of kernel.org's git:
69M bitbucket_git.tcpflow
69M bitbucket_https.tcpflow

$ time git clone git://git.kernel.org/pub/scm/git/git.git
Cloning into 'git'...
remote: Enumerating objects: 15475, done.
remote: Counting objects: 100% (15475/15475), done.
remote: Compressing objects: 100% (861/861), done.
remote: Total 287977 (delta 14910), reused 14907 (delta 14610),
pack-reused 272502
Receiving objects: 100% (287977/287977), 66.09 MiB | 4.87 MiB/s, done.
Resolving deltas: 100% (217420/217420), done.

real    0m20.000s
user    0m15.414s
sys     0m1.606s

$ time wget https://calebgray.com/public/kernelorg.tar.gz
--2020-05-25 06:11:29--  https://calebgray.com/public/kernelorg.tar.gz
Resolving calebgray.com (calebgray.com)... 192.3.203.78
Connecting to calebgray.com (calebgray.com)|192.3.203.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74593708 (71M) [application/octet-stream]
Saving to: ‘kernelorg.tar.gz’

kernelorg.tar.gz
100%[========================================================================================>]
 71.14M  4.81MB/s    in 19s

2020-05-25 06:11:48 (3.79 MB/s) - ‘kernelorg.tar.gz’ saved [74593708/74593708]

real 0m19.420s
user 0m0.030s
sys 0m0.280s


Thanks everyone for your input and time! I love git, you guys do great work!

P.S. I ran a few other benchmarks outside of these, and the timing
always worked out to be more/less the same between the reported
transfer rate (as told by my router, as well) and the "real" time it
took to download (for both `git` and `wget`).

P.P.S. I haven't investigated the reason for the github repo being
nearly twice the size as the kernel.org hosted copy. That one stands
out as potentially part of the proxy discussion, or there's actually a
difference in the repo's data. Curiosity will likely get the best of
me eventually.




On Mon, May 18, 2020 at 9:40 PM Konstantin Tokarev <annulen@xxxxxxxxx> wrote:
>
>
>
> 18.05.2020, 01:12, "Konstantin Ryabitsev" <konstantin@xxxxxxxxxxxxxxxxxxx>:
> > On Fri, May 15, 2020 at 09:42:57PM +0000, Eric Wong wrote:
> >>  That said, I'm not sure if any client-side caching proxies can
> >>  MITM HTTPS and save bandwidth with HTTPS everywhere, nowadays.
> >>  I seem to recall polipo being abandoned because of HTTPS.
> >>  Maybe there's a caching HTTPS MITM proxy out there...
> >
> > Right, this can't operate as a transparent proxy.
>
> AFAIK, Squid can do MITM, caching and operate transparently.
> In the past it was done via ssl_bump directive, but seems like syntax changed a bit
> in modern versions.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux