ds/multi-pack-index (was Re: What's cooking in git.git (Jul 2018, #02; Wed, 18))

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 20 Jul 2018 09:42:13 -0400

On 7/18/2018 6:03 PM, Junio C Hamano wrote:
* ds/multi-pack-index (2018-07-12) 23 commits
  - midx: clear midx on repack
  - packfile: skip loading index if in multi-pack-index
  - midx: prevent duplicate packfile loads
  - midx: use midx in approximate_object_count
  - midx: use existing midx when writing new one
  - midx: use midx in abbreviation calculations
  - midx: read objects from multi-pack-index
  - config: create core.multiPackIndex setting
  - midx: write object offsets
  - midx: write object id fanout chunk
  - midx: write object ids in a chunk
  - midx: sort and deduplicate objects from packfiles
  - midx: read pack names into array
  - multi-pack-index: write pack names in chunk
  - multi-pack-index: read packfile list
  - packfile: generalize pack directory list
  - t5319: expand test data
  - multi-pack-index: load into memory
  - midx: write header information to lockfile
  - multi-pack-index: add 'write' verb
  - multi-pack-index: add builtin
  - multi-pack-index: add format details
  - multi-pack-index: add design document

  When there are too many packfiles in a repository (which is not
  recommended), looking up an object in these would require
  consulting many pack .idx files; a new mechanism to have a single
  file that consolidates all of these .idx files is introduced.

  What's the doneness of this one?  I vaguely recall that there was
  an objection against the concept as a whole (i.e. there is a way
  with less damage to gain the same object-abbrev performance); has
  it (and if anything else, they) been resolved in satisfactory
  fashion?

I believe you're talking about Ævar's patch series [1] on unconditional 
abbreviation lengths. His patch gets similar speedups by completely 
eliminating the abbreviation computation in favor of a relative increase 
that is very likely to avoid collisions. While abbreviation speedups are 
the most dramatic measurable improvement by the multi-pack-index 
feature, it is not the only important feature.

Lookup speeds improve in a multi-pack environment. While only the 
largest of largest repos have trouble repacking into a single pack, 
there are many scenarios where users disable auto-gc and do not repack 
frequently. On-premise build machines are the ones I know about the 
most: these machines are run 24/7 to perform incremental fetches against 
a remote and kick off a build. Admins frequently turn off GC so the 
build times are not impacted. Eventually, their performance does degrade 
due to the number of packfiles. The answer we give to them is to set up 
scheduled maintenance to repack. These users don't need the space 
savings of a repack, but just need consistent performance and high 
up-time. The multi-pack-index could assist here (as long as we set up 
auto-computing the multi-pack-index after a fetch).

That's the best I can do to sell the feature as it stands now (plus the 
'fsck' integration that would follow after this series is accepted).

I have mentioned the potential for the multi-pack-index to do the following:

* Store metadata about the packfiles, possibly replacing the .keep and 
.promisor files, and allowing other extensions to inform repack algorithms.

* Store a stable object order, allowing the reachability bitmap to be 
computed at a different cadence from repacking the packfiles.

I'm interested in these applications, but I will admit that they are not 
on the top of my priority list at the moment. Right now, I'm focused on 
reaching feature parity with the version of the MIDX we have in our GVFS 
fork of Git, and then extending the feature to have incremental 
multi-pack-index files to solve the "big write" problem.

Thanks,

-Stolee

[1] 
https://public-inbox.org/git/20180608224136.20220-1-avarab@xxxxxxxxx/T/#u

     [PATCH 00/20] unconditional O(1) SHA-1 abbreviation