On 8/29/2018 9:27 AM, Derrick Stolee wrote:
On 8/29/2018 9:09 AM, Johannes Schindelin wrote:
What I meant was to leverage the midx code, not the .midx files.
My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.
I'm unsure what this means, as the multi-pack-index simply uses
bsearch_hash() to find hashes in the list. The same method is used for
IDX lookups.
I talked with Johannes privately, and we found differences in our
understanding of the current multi-pack-index feature. Johannes thought
the feature was farther along than it is, specifically related to how
much we value the data in the multi-pack-index when adding objects to
pack-files or repacking. Some of this misunderstanding is due to how the
equivalent feature works in VSTS (where there is no IDX-file equivalent,
every object in the repo is tracked by a multi-pack-index).
I'd like to point out a few things about how the multi-pack-index works
now, and how we hope to extend it in the future.
Currently:
1. Objects are added to the multi-pack-index by adding a new set of
.idx/.pack file pairs. We scan the .idx file for the objects and offsets
to add.
2. We re-use the information in the multi-pack-index only to write the
new one without re-reading the .pack files that are already covered.
3. If a 'git repack' command deletes a pack-file, then we delete the
multi-pack-index. It must be regenerated by 'git multi-pack-index write'
later.
In the current world, the multi-pack-index is completely secondary to
the .idx files.
In the future, I hope these features exist in the multi-pack-index:
1. A stable object order. As objects are added to the multi-pack-index,
we assign a distinct integer value to each. As we add objects, those
integers values do not change. We can then pair the reachability bitmap
to the multi-pack-index instead of a specific pack-file (allowing repack
and bitmap computations to happen asynchronously). The data required to
store this object order is very similar to storing the bijection between
SHA-1 and SHA-256 hashes.
2. Incremental multi-pack-index: Currently, we have only one
multi-pack-index file per object directory. We can use a mechanism
similar to the split-index to keep a small number of multi-pack-index
files (at most 3, probably) such that the
'.git/objects/pack/multi-pack-index' file is small and easy to rewrite,
while it refers to larger '.git/objects/pack/*.midx' files that change
infrequently.
3. Multi-pack-index-aware repack: The repacker only knows about the
multi-pack-index enough to delete it. We could instead directly
manipulate the multi-pack-index during repack, and we could decide to do
more incremental repacks based on data stored in the multi-pack-index.
In conclusion: please keep the multi-pack-index in mind as we implement
the transition plan. I'll continue building the feature as planned (the
next thing to do after the current series of cleanups is 'git
multi-pack-index verify') but am happy to look into other applications
as we need it.
Thanks,
-Stolee