Re: [RFC] Add --create-cache to repack

Shawn Pearce <spearce@xxxxxxxxxxx> · Mon, 31 Jan 2011 10:47:34 -0800

On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>>> >
>>> >> This started because I was looking for a way to speed up clones coming
>>> >> from a JGit server.  Cloning the linux-2.6 repository is painful,
>
> Well, scratch the idea in this thread.  I think.

Nope, I'm back in favor with this after fixing JGit's thin pack
generation.  Here's why.

Take linux-2.6.git as of Jan 12th, with the cache root as of Dec 28th:

  $ git update-ref HEAD f878133bf022717b880d0e0995b8f91436fd605c
  $ git-repack.sh --create-cache \
      --cache-root=b52e2a6d6d05421dea6b6a94582126af8cd5cca2 \
      --cache-include=v2.6.11-tree
  $ git repack -a -d

  $ ls -lh objects/pack/
  total 456M
  1.4M pack-74af5edca80797736fe4de7279b2a81af98470a5.idx
  38M pack-74af5edca80797736fe4de7279b2a81af98470a5.pack

  49M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.idx
  89 pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.keep
  368M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.pack

Our "recent history" is 38M, and our "cached pack" is 368M.  Its a bit
more disk than is strictly necessary, this should be ~380M.  Call it
~26M of wasted disk.  The "cached object list" I proposed elsewhere in
this thread would cost about 41M of disk and is utterly useless except
for initial clones.  Here we are wasting about 26M of disk to have
slightly shorter delta chains in the cached pack (otherwise known as
our ancient history).  So its a slightly smaller waste, and we get
some (minor) benefit.

Clone without pack caching:

  $ time git clone --bare git://localhost/tmp_linux26_withTag tmp_in.git
  Cloning into bare repository tmp_in.git...
  remote: Counting objects: 1861830, done
  remote: Finding sources: 100% (1861830/1861830)
  remote: Getting sizes: 100% (88243/88243)
  remote: Compressing objects: 100% (88184/88184)
  Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
  remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
  Resolving deltas: 100% (1564621/1564621), done.

  real	3m19.005s
  user	1m36.250s
  sys	0m10.290s

Clone with pack caching:

  $ time git clone --bare git://localhost/tmp_linux26_withTag tmp_in.git
  Cloning into bare repository tmp_in.git...
  remote: Counting objects: 1601, done
  remote: Counting objects: 1828460, done
  remote: Finding sources: 100% (50475/50475)
  remote: Getting sizes: 100% (18843/18843)
  remote: Compressing objects: 100% (7585/7585)
  remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
  Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
  Resolving deltas: 100% (1559477/1559477), done.

  real	2m2.938s
  user	1m35.890s
  sys	0m9.830s

Using the cached pack increased our total data transfer by 2.39 MiB,
but saved 1m17s on server computation time.  If we go back and look at
our cached pack size (368M), the leading thin-pack should be about
10.4 MiB (378.40M - 368M = 10.4M).  If I modify the tmp_in.git client
to have only the cached pack's tip and fetch using CGit, we see the
thin pack to bring ourselves current is 11.07 MiB (JGit does this in
10.96 MiB):

  $ cd tmp_in.git
  $ git update-ref HEAD b52e2a6d6d05421dea6b6a94582126af8cd5cca2
  $ git repack -a -d  ; # yay we are at ~1 month ago

  $ time git fetch ../tmp_linux26_withTag
  remote: Counting objects: 60570, done.
  remote: Compressing objects: 100% (11924/11924), done.
  remote: Total 49804 (delta 42196), reused 44837 (delta 37231)
  Receiving objects: 100% (49804/49804), 11.07 MiB | 7.37 MiB/s, done.
  Resolving deltas: 100% (42196/42196), completed with 4956 local objects.
  From ../tmp_linux26_withTag
   * branch            HEAD       -> FETCH_HEAD

  real	0m35.083s
  user	0m25.710s
  sys	0m1.190s

The pack caching feature is *no worse* in transfer size than if the
client copied the pack from 1 month ago, and then did an incremental
fetch to bring themselves current.  Compared to the naive clone, it
saves an incredible amount of working set space and CPU time.  The
server only needs to keep track of the incremental thin pack, and can
completely ignore the ancient history objects.  Its a great
alternative for projects that want users to rsync/http dumb transport
down a large stable repository, then incremental fetch themselves
current.  Or busy mirror sites that are willing to trade some small
bandwidth for server CPU and memory.

In this particular example, there is ~11 MiB of data that cannot be
safely resumed, or the first 2.9%.  At 56 KiB/s, a client needs to get
through the first 3 minutes of transfer before they can reach the
resumable checkpoint (where the thin pack ends, and the cached pack
starts).  It would be better if we could resume anywhere in the
stream, but being able to resume the last 97% is infinitely better
than being able to resume nothing.  If someone wants to really go
crazy, this is where a "gittorrent" client could start up and handle
the remaining 97% of the transfer.  :-)

I think this is worthwhile.  If we are afraid of the extra 2.39 MiB
data transfer this forces on the client when the repository owner
enables the feature, we should go back and improve our thin-pack code.
 Transferring 11 MiB to catch up a kernel from Dec 28th to Jan 12th
sounds like a lot of data, and any improvements in the general
thin-pack code would shrink the leading thin-pack, possibly getting us
that 2.39 MiB back.

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html