[PATCH 00/12] Integrating line-log and changed-path Bloom filters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The best way to learn how a feature works is to build on it. That's also the
best way to find defects.

I started on this patch series with two goals:

 1. Understand and submit Szeder's incremental line-log RFC [1].
 2. Integrate changed-path Bloom filters with line-log.

[1] https://lore.kernel.org/git/20190818182801.7580-1-szeder.dev@xxxxxxxxx/

Since I knew a lot about the Bloom filters and not much about line-log.c, I
expected to learn a lot that could contribute to improving Szeder's patches.
Instead, I discovered some deficiencies in the Bloom filter code, including
a few meaningful bugs that should be fixed before the topic is merge to
master.

This series is organized as follows:

 * Patches 1-4 are minor Bloom filter cleanups, including a documentation
   bug.
   
   
 * Patch 5 fixes a problem where the Bloom filters use much larger filters
   than expected.
   
   
 * Patch 6 fixes a bug related to the short-circuit of large diffs from
   e369698016 (diff: halt tree-diff early after max_changes, 2020-03-30).
   The condition for halting the diff computation is different than the
   check in bloom.c to see if that short-circuit was hit, which leads to
   some commits with large diffs getting an incomplete Bloom filter. This
   happened on the root commit of the Linux kernel repo, for example.
   
   
 * Patches 7-11 are Szeder's RFC, with my sign-off added. I have not changed
   the code from those submissions because I didn't find anything wrong with
   them as I was testing. However, I will take responsibility to respond to
   the feedback in this series.
   
   
 * Patch 12 integrates Bloom filters with the line-log code.
   
   

I organized these patches so patches 1-6 could be split into its own topic,
if beneficial to take earlier than the line-log changes.

The end result of this combined effort is the following: git log -L commands
feel much more responsive to a terminal user because Szeder's changes make
the computation incremental, and they have better end-to-end performance
because of the integration with Bloom filters to reduce time spent running
tree diffs for TREESAME commits.

The following table is also in patch 12, but it summarizes the results of
this series. These are timings for running git log -L 1,10:$path for these
recently-edited paths. The "Entire History" columns wait for the full
command to complete. The "First Result" columns add "-n 1" to the command,
so we see the benefits of the incremental algorithm. Thus, the performance
improvements from "Entire History" to "First Result" are due entirely to
Szeder's patches. The performance improvements from "Before" to "After" are
due to the changed-path Bloom filters.

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

Some splashy numbers include a 110x speedup for finding the first result for
Documentation/scsi/index.rst. That's certainly an outlier, but the rest have
an at least 10x speedup.

Thanks, -Stolee

Derrick Stolee (7):
  bloom: fix whitespace around tab length
  test-bloom: fix usage typo
  Documentation: changed-path Bloom filters use byte words
  bloom: de-duplicate directory entries
  bloom: parse commit before computing filters
  bloom: use num_changes not nr for limit detection
  line-log: integrate with changed-path Bloom filters

SZEDER Gábor (5):
  completion: offer '--(no-)patch' among 'git log' options
  line-log: remove unused fields from 'struct line_log_data'
  t4211-line-log: add tests for parent oids
  line-log: more responsive, incremental 'git log -L'
  line-log: try to use generation number-based topo-ordering

 .../technical/commit-graph-format.txt         |  8 +--
 bloom.c                                       | 61 ++++++++++++-----
 bloom.h                                       |  5 +-
 contrib/completion/git-completion.bash        |  1 +
 line-log.c                                    | 43 +++++++++++-
 line-log.h                                    |  5 +-
 revision.c                                    | 43 ++++++++++--
 t/helper/test-bloom.c                         |  2 +-
 t/t0095-bloom.sh                              |  6 +-
 t/t4211-line-log.sh                           | 68 +++++++++++++++++++
 10 files changed, 201 insertions(+), 41 deletions(-)


base-commit: 1b4c57fa87e121f155863f898dc39d06cf4a1d99
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-622%2Fderrickstolee%2Flog-l-on-bloom-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-622/derrickstolee/log-l-on-bloom-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/622
-- 
gitgitgadget



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux