[PATCH v2 0/5] Improve abbreviation disambiguation

Derrick Stolee <dstolee@xxxxxxxxxxxxx> · Mon, 25 Sep 2017 05:54:47 -0400

Thanks for the feedback on my v1 patch. Thanks also to Jeff Hostetler
for helping me with this v2 patch, which includes an extra performance
improvement in commit 5.

Changes since last version:

* Added helper program test-list-objects to construct a list of
  existing object ids.

* test-abbrev now disambiguates a list of OIDs from stdin.

* p0008-abbrev.sh now has two tests:
    * 0008.1 tests 100,000 known OIDs
    * 0008.2 tests 100,000 missing OIDs

* Added a third performance improvement that uses binary search within
  packfiles to inspect at most two object ids per packfile.

Thanks,
 Stolee

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

A helper program `test-list-objects` outputs a sampling of object ids,
which we reorder using `sort -R` before using them as input to a
performance test. 

A performance helper `test-abbrev` and performance test `p0008-abbrev.sh`
are added to demonstrate performance improvements in this area.

I include performance test numbers in the commit messages for each
change, but I also include the difference between the baseline and the
final change here:

p0008.1: find_unique_abbrev() for existing objects
--------------------------------------------------

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|       | Pack  | Packed  | Loose  | Base   | New    |        |
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|-------|-------|---------|--------|--------|--------|--------|
| Git   |     1 |  230078 |      0 | 0.12 s | 0.05 s | -58.3% |
| Git   |     5 |  230162 |      0 | 0.25 s | 0.15 s | -40.0% |
| Git   |     4 |  154310 |  75852 | 0.18 s | 0.11 s | -38.9% |
| Linux |     1 | 5606645 |      0 | 0.32 s | 0.10 s | -68.8% |
| Linux |    24 | 5606645 |      0 | 2.77 s | 2.00 s | -27.8% |
| Linux |    23 | 5283204 | 323441 | 2.86 s | 1.62 s | -43.4% |
| VSTS  |     1 | 4355923 |      0 | 0.27 s | 0.09 s | -66.7% |
| VSTS  |    32 | 4355923 |      0 | 2.41 s | 2.01 s | -16.6% |
| VSTS  |    31 | 4276829 |  79094 | 4.22 s | 3.02 s | -28.4% |

For the Windows repo running in Windows Subsystem for Linux:

    Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
     Base Time: 5.69 s
      New Time: 4.09 s
         Rel %: -28.1%

p0008.2: find_unique_abbrev() for missing objects
-------------------------------------------------

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|       | Pack  | Packed  | Loose  | Base   | New    |        |
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|-------|-------|---------|--------|--------|--------|--------|
| Git   |     1 |  230078 |      0 | 0.61 s | 0.04 s | -93.4% |
| Git   |     5 |  230162 |      0 | 1.30 s | 0.15 s | -88.5% |
| Git   |     4 |  154310 |  75852 | 1.07 s | 0.12 s | -88.8% |
| Linux |     1 | 5606645 |      0 | 0.65 s | 0.05 s | -92.3% |
| Linux |    24 | 5606645 |      0 | 7.12 s | 1.28 s | -82.0% |
| Linux |    23 | 5283204 | 323441 | 6.33 s | 0.96 s | -84.8% |
| VSTS  |     1 | 4355923 |      0 | 0.64 s | 0.05 s | -92.2% |
| VSTS  |    32 | 4355923 |      0 | 7.77 s | 1.36 s | -82.5% |
| VSTS  |    31 | 4276829 |  79094 | 7.76 s | 1.36 s | -82.5% |

For the Windows repo running in Windows Subsystem for Linux:

    Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
     Base Time: 38.9 s
      New Time:  2.7 s
         Rel %: -93.1%

Derrick Stolee (5):
  test-list-objects: List a subset of object ids
  p0008-abbrev.sh: Test find_unique_abbrev() perf
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix
  sha1_name: Minimize OID comparisons during disambiguation

 Makefile                     |   2 +
 sha1_name.c                  | 128 ++++++++++++++++++++++++++++++++++++++-----
 t/helper/.gitignore          |   2 +
 t/helper/test-abbrev.c       |  19 +++++++
 t/helper/test-list-objects.c |  85 ++++++++++++++++++++++++++++
 t/perf/p0008-abbrev.sh       |  22 ++++++++
 6 files changed, 243 insertions(+), 15 deletions(-)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100644 t/helper/test-list-objects.c
 create mode 100755 t/perf/p0008-abbrev.sh

-- 
2.14.1.538.g56ec8fc98.dirty