ds/multi-pack-index (was Re: What's cooking in git.git (Jul 2018, #02; Wed, 18))

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/18/2018 6:03 PM, Junio C Hamano wrote:
* ds/multi-pack-index (2018-07-12) 23 commits
  - midx: clear midx on repack
  - packfile: skip loading index if in multi-pack-index
  - midx: prevent duplicate packfile loads
  - midx: use midx in approximate_object_count
  - midx: use existing midx when writing new one
  - midx: use midx in abbreviation calculations
  - midx: read objects from multi-pack-index
  - config: create core.multiPackIndex setting
  - midx: write object offsets
  - midx: write object id fanout chunk
  - midx: write object ids in a chunk
  - midx: sort and deduplicate objects from packfiles
  - midx: read pack names into array
  - multi-pack-index: write pack names in chunk
  - multi-pack-index: read packfile list
  - packfile: generalize pack directory list
  - t5319: expand test data
  - multi-pack-index: load into memory
  - midx: write header information to lockfile
  - multi-pack-index: add 'write' verb
  - multi-pack-index: add builtin
  - multi-pack-index: add format details
  - multi-pack-index: add design document

  When there are too many packfiles in a repository (which is not
  recommended), looking up an object in these would require
  consulting many pack .idx files; a new mechanism to have a single
  file that consolidates all of these .idx files is introduced.

  What's the doneness of this one?  I vaguely recall that there was
  an objection against the concept as a whole (i.e. there is a way
  with less damage to gain the same object-abbrev performance); has
  it (and if anything else, they) been resolved in satisfactory
  fashion?

I believe you're talking about Ævar's patch series [1] on unconditional abbreviation lengths. His patch gets similar speedups by completely eliminating the abbreviation computation in favor of a relative increase that is very likely to avoid collisions. While abbreviation speedups are the most dramatic measurable improvement by the multi-pack-index feature, it is not the only important feature.

Lookup speeds improve in a multi-pack environment. While only the largest of largest repos have trouble repacking into a single pack, there are many scenarios where users disable auto-gc and do not repack frequently. On-premise build machines are the ones I know about the most: these machines are run 24/7 to perform incremental fetches against a remote and kick off a build. Admins frequently turn off GC so the build times are not impacted. Eventually, their performance does degrade due to the number of packfiles. The answer we give to them is to set up scheduled maintenance to repack. These users don't need the space savings of a repack, but just need consistent performance and high up-time. The multi-pack-index could assist here (as long as we set up auto-computing the multi-pack-index after a fetch).

That's the best I can do to sell the feature as it stands now (plus the 'fsck' integration that would follow after this series is accepted).

I have mentioned the potential for the multi-pack-index to do the following:

* Store metadata about the packfiles, possibly replacing the .keep and .promisor files, and allowing other extensions to inform repack algorithms.

* Store a stable object order, allowing the reachability bitmap to be computed at a different cadence from repacking the packfiles.

I'm interested in these applications, but I will admit that they are not on the top of my priority list at the moment. Right now, I'm focused on reaching feature parity with the version of the MIDX we have in our GVFS fork of Git, and then extending the feature to have incremental multi-pack-index files to solve the "big write" problem.

Thanks,

-Stolee

[1] https://public-inbox.org/git/20180608224136.20220-1-avarab@xxxxxxxxx/T/#u

     [PATCH 00/20] unconditional O(1) SHA-1 abbreviation




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux