[PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This series optimizes blob downloading in merges for partial clones. It can
apply on master. It's independent of ort-perf-batch-12.

Changes since v1:

 * Incorporated the suggestions from Stolee on patch 2.

This series has a minor conflict with jt/partial-clone-submodule-1. I asked
about this previously and it was decided to just submit these topics
independently[1]. The conflict is that both topics add a "repo" argument to
fetch_objects(), but jt/partial-clone-submodule-1 also makes additional
nearby changes.

[1]
https://lore.kernel.org/git/20210609045804.2329079-1-jonathantanmy@xxxxxxxxxx/

=== Basic Optimization idea ===

merge-ort was designed to minimize the computation needed to complete a
merge, and much of that (particularly the "irrelevant rename"
determinations) also dramatically reduced the amount of data needed for the
merge. Reducing the amount of data needed to do computations ought to
benefit partial clones as well by enabling them to download less
information. However, my previous series didn't modify the prefetch()
command in diffcore-rename to take advantage of these reduced data
requirements. This series changes that.

Further, although diffcore-rename batched downloads of objects for rename
detection, the merge machinery did not do the same for three-way content
merges of files. This series adds batch downloading of objects within
merge-ort to correct that.

=== Modified performance measurement method ===

The testcases I've been using so far to measure performance were not run in
a partial clone, so they aren't directly usable for comparison. Further,
partial clone performance depends on network speed which can be highly
variable. So I want to modify one of the existing testcases slightly and
focus on two different but more stable metrics:

 1. Number of git fetch operations during rebase
 2. Number of objects fetched during rebase

The first of these should already be decent due to Jonathan Tan's work to
batch fetching of missing blobs during rename detection (see commit
7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)), so we are
mostly looking to optimize the second but would like to also decrease the
first if possible.

The testcase we will look at will be a modification of the mega-renames
testcase from commit 557ac0350d ("merge-ort: begin performance work;
instrument with trace2_region_* calls", 2020-10-28). In particular, we
change

$ git clone \
    git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git


to

$ git clone --sparse --filter=blob:none \
    https://github.com/github/linux


(The change in clone URL is just to get a server that supports the filter
predicate.)

We otherwise keep the test the same (in particular, we do not add any calls
to "git-sparse checkout {set,add}" which means that the resulting repository
will only have 7 total blobs from files in the toplevel directory before
starting the rebase).

=== Results ===

For the mega-renames testcase noted above (which rebases 35 commits across
an upstream with ~26K renames in a partial clone), I found the following
results for our metrics of interest:

     Number of `git fetch` ops during rebase

                     Before Series   After Series
merge-recursive:          62              63
merge-ort:                30              20


     Number of objects fetched during rebase

                     Before Series   After Series
merge-recursive:         11423          11423
merge-ort:               11391             63


So, we have a significant reduction (factor of ~3 relative to
merge-recursive) in the number of git fetch operations that have to be
performed in a partial clone to complete the rebase, and a dramatic
reduction (factor of ~180) in the number of objects that need to be fetched.

=== Summary ===

It's worth pointing out that merge-ort after the series needs only ~1.8
blobs per commit being transplanted to complete this particular rebase.
Essentially, this reinforces the fact the optimization work so far has taken
rename detection from often being an overwhelmingly costly portion of a
merge (leading many to just capitulate on it), to what I have observed in my
experience so far as being just a minor cost for merges.

Elijah Newren (5):
  promisor-remote: output trace2 statistics for number of objects
    fetched
  t6421: add tests checking for excessive object downloads during merge
  diffcore-rename: allow different missing_object_cb functions
  diffcore-rename: use a different prefetch for basename comparisons
  merge-ort: add prefetching for content merges

 diffcore-rename.c              | 149 ++++++++---
 merge-ort.c                    |  50 ++++
 promisor-remote.c              |   7 +-
 t/t6421-merge-partial-clone.sh | 439 +++++++++++++++++++++++++++++++++
 4 files changed, 611 insertions(+), 34 deletions(-)
 create mode 100755 t/t6421-merge-partial-clone.sh


base-commit: 6de569e6ac492213e81321ca35f1f1b365ba31e3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-969%2Fnewren%2Fort-perf-batch-13-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-969/newren/ort-perf-batch-13-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/969

Range-diff vs v1:

 1:  ad9b451d6823 = 1:  04f5ebdabe14 promisor-remote: output trace2 statistics for number of objects fetched
 2:  6462bb63310d ! 2:  0f786cfb4c95 t6421: add tests checking for excessive object downloads during merge
     @@ t/t6421-merge-partial-clone.sh (new)
      +
      +		git checkout -q origin/A &&
      +
     -+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-single &&
     ++		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
     ++			-c merge.directoryRenames=true merge --no-stat \
     ++			--no-progress origin/B-single &&
      +
      +		# Check the number of objects we reported we would fetch
      +		cat >expect <<-EOF &&
     -+		 ..........fetch_count:2
     -+		 ......fetch_count:1
     ++		fetch_count:2
     ++		fetch_count:1
      +		EOF
     -+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
     ++		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
      +		test_cmp expect actual &&
      +
      +		# Check the number of fetch commands exec-ed
     @@ t/t6421-merge-partial-clone.sh (new)
      +
      +		git checkout -q origin/A &&
      +
     -+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-dir &&
     ++		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
     ++			-c merge.directoryRenames=true merge --no-stat \
     ++			--no-progress origin/B-dir &&
      +
      +		# Check the number of objects we reported we would fetch
      +		cat >expect <<-EOF &&
     -+		 ..........fetch_count:6
     ++		fetch_count:6
      +		EOF
     -+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
     ++		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
      +		test_cmp expect actual &&
      +
      +		# Check the number of fetch commands exec-ed
     @@ t/t6421-merge-partial-clone.sh (new)
      +
      +		git checkout -q origin/A &&
      +
     -+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-many &&
     ++		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
     ++			-c merge.directoryRenames=true merge --no-stat \
     ++			--no-progress origin/B-many &&
      +
      +		# Check the number of objects we reported we would fetch
      +		cat >expect <<-EOF &&
     -+		 ..........fetch_count:12
     -+		 ..........fetch_count:5
     -+		 ..........fetch_count:3
     -+		 ......fetch_count:2
     ++		fetch_count:12
     ++		fetch_count:5
     ++		fetch_count:3
     ++		fetch_count:2
      +		EOF
     -+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
     ++		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
      +		test_cmp expect actual &&
      +
      +		# Check the number of fetch commands exec-ed
 3:  c4b3109c3b08 = 3:  9f2a8ed8d61f diffcore-rename: allow different missing_object_cb functions
 4:  f4ade3996d3f = 4:  f753f8035564 diffcore-rename: use a different prefetch for basename comparisons
 5:  ca3b2a743b8e = 5:  317bcc7f56cb merge-ort: add prefetching for content merges

-- 
gitgitgadget



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux