On 7/18/2018 6:03 PM, Junio C Hamano wrote:
* ds/multi-pack-index (2018-07-12) 23 commits
- midx: clear midx on repack
- packfile: skip loading index if in multi-pack-index
- midx: prevent duplicate packfile loads
- midx: use midx in approximate_object_count
- midx: use existing midx when writing new one
- midx: use midx in abbreviation calculations
- midx: read objects from multi-pack-index
- config: create core.multiPackIndex setting
- midx: write object offsets
- midx: write object id fanout chunk
- midx: write object ids in a chunk
- midx: sort and deduplicate objects from packfiles
- midx: read pack names into array
- multi-pack-index: write pack names in chunk
- multi-pack-index: read packfile list
- packfile: generalize pack directory list
- t5319: expand test data
- multi-pack-index: load into memory
- midx: write header information to lockfile
- multi-pack-index: add 'write' verb
- multi-pack-index: add builtin
- multi-pack-index: add format details
- multi-pack-index: add design document
When there are too many packfiles in a repository (which is not
recommended), looking up an object in these would require
consulting many pack .idx files; a new mechanism to have a single
file that consolidates all of these .idx files is introduced.
What's the doneness of this one? I vaguely recall that there was
an objection against the concept as a whole (i.e. there is a way
with less damage to gain the same object-abbrev performance); has
it (and if anything else, they) been resolved in satisfactory
fashion?
I believe you're talking about Ævar's patch series [1] on unconditional
abbreviation lengths. His patch gets similar speedups by completely
eliminating the abbreviation computation in favor of a relative increase
that is very likely to avoid collisions. While abbreviation speedups are
the most dramatic measurable improvement by the multi-pack-index
feature, it is not the only important feature.
Lookup speeds improve in a multi-pack environment. While only the
largest of largest repos have trouble repacking into a single pack,
there are many scenarios where users disable auto-gc and do not repack
frequently. On-premise build machines are the ones I know about the
most: these machines are run 24/7 to perform incremental fetches against
a remote and kick off a build. Admins frequently turn off GC so the
build times are not impacted. Eventually, their performance does degrade
due to the number of packfiles. The answer we give to them is to set up
scheduled maintenance to repack. These users don't need the space
savings of a repack, but just need consistent performance and high
up-time. The multi-pack-index could assist here (as long as we set up
auto-computing the multi-pack-index after a fetch).
That's the best I can do to sell the feature as it stands now (plus the
'fsck' integration that would follow after this series is accepted).
I have mentioned the potential for the multi-pack-index to do the following:
* Store metadata about the packfiles, possibly replacing the .keep and
.promisor files, and allowing other extensions to inform repack algorithms.
* Store a stable object order, allowing the reachability bitmap to be
computed at a different cadence from repacking the packfiles.
I'm interested in these applications, but I will admit that they are not
on the top of my priority list at the moment. Right now, I'm focused on
reaching feature parity with the version of the MIDX we have in our GVFS
fork of Git, and then extending the feature to have incremental
multi-pack-index files to solve the "big write" problem.
Thanks,
-Stolee
[1]
https://public-inbox.org/git/20180608224136.20220-1-avarab@xxxxxxxxx/T/#u
[PATCH 00/20] unconditional O(1) SHA-1 abbreviation