Re: Questions about the hash function transition

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/29/2018 9:27 AM, Derrick Stolee wrote:
On 8/29/2018 9:09 AM, Johannes Schindelin wrote:

What I meant was to leverage the midx code, not the .midx files.

My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.

I'm unsure what this means, as the multi-pack-index simply uses bsearch_hash() to find hashes in the list. The same method is used for IDX lookups.

I talked with Johannes privately, and we found differences in our understanding of the current multi-pack-index feature. Johannes thought the feature was farther along than it is, specifically related to how much we value the data in the multi-pack-index when adding objects to pack-files or repacking. Some of this misunderstanding is due to how the equivalent feature works in VSTS (where there is no IDX-file equivalent, every object in the repo is tracked by a multi-pack-index).

I'd like to point out a few things about how the multi-pack-index works now, and how we hope to extend it in the future.

Currently:

1. Objects are added to the multi-pack-index by adding a new set of .idx/.pack file pairs. We scan the .idx file for the objects and offsets to add.

2. We re-use the information in the multi-pack-index only to write the new one without re-reading the .pack files that are already covered.

3. If a 'git repack' command deletes a pack-file, then we delete the multi-pack-index. It must be regenerated by 'git multi-pack-index write' later.

In the current world, the multi-pack-index is completely secondary to the .idx files.

In the future, I hope these features exist in the multi-pack-index:

1. A stable object order. As objects are added to the multi-pack-index, we assign a distinct integer value to each. As we add objects, those integers values do not change. We can then pair the reachability bitmap to the multi-pack-index instead of a specific pack-file (allowing repack and bitmap computations to happen asynchronously). The data required to store this object order is very similar to storing the bijection between SHA-1 and SHA-256 hashes.

2. Incremental multi-pack-index: Currently, we have only one multi-pack-index file per object directory. We can use a mechanism similar to the split-index to keep a small number of multi-pack-index files (at most 3, probably) such that the '.git/objects/pack/multi-pack-index' file is small and easy to rewrite, while it refers to larger '.git/objects/pack/*.midx' files that change infrequently.

3. Multi-pack-index-aware repack: The repacker only knows about the multi-pack-index enough to delete it. We could instead directly manipulate the multi-pack-index during repack, and we could decide to do more incremental repacks based on data stored in the multi-pack-index.

In conclusion: please keep the multi-pack-index in mind as we implement the transition plan. I'll continue building the feature as planned (the next thing to do after the current series of cleanups is 'git multi-pack-index verify') but am happy to look into other applications as we need it.

Thanks,

-Stolee




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux