The best way to learn how a feature works is to build on it. That's also the best way to find defects. I started on this patch series with two goals: 1. Understand and submit Szeder's incremental line-log RFC [1]. 2. Integrate changed-path Bloom filters with line-log. [1] https://lore.kernel.org/git/20190818182801.7580-1-szeder.dev@xxxxxxxxx/ Since I knew a lot about the Bloom filters and not much about line-log.c, I expected to learn a lot that could contribute to improving Szeder's patches. Instead, I discovered some deficiencies in the Bloom filter code, including a few meaningful bugs that should be fixed before the topic is merge to master. This series is organized as follows: * Patches 1-4 are minor Bloom filter cleanups, including a documentation bug. * Patch 5 fixes a problem where the Bloom filters use much larger filters than expected. * Patch 6 fixes a bug related to the short-circuit of large diffs from e369698016 (diff: halt tree-diff early after max_changes, 2020-03-30). The condition for halting the diff computation is different than the check in bloom.c to see if that short-circuit was hit, which leads to some commits with large diffs getting an incomplete Bloom filter. This happened on the root commit of the Linux kernel repo, for example. * Patches 7-11 are Szeder's RFC, with my sign-off added. I have not changed the code from those submissions because I didn't find anything wrong with them as I was testing. However, I will take responsibility to respond to the feedback in this series. * Patch 12 integrates Bloom filters with the line-log code. I organized these patches so patches 1-6 could be split into its own topic, if beneficial to take earlier than the line-log changes. The end result of this combined effort is the following: git log -L commands feel much more responsive to a terminal user because Szeder's changes make the computation incremental, and they have better end-to-end performance because of the integration with Bloom filters to reduce time spent running tree diffs for TREESAME commits. The following table is also in patch 12, but it summarizes the results of this series. These are timings for running git log -L 1,10:$path for these recently-edited paths. The "Entire History" columns wait for the full command to complete. The "First Result" columns add "-n 1" to the command, so we see the benefits of the incremental algorithm. Thus, the performance improvements from "Entire History" to "First Result" are due entirely to Szeder's patches. The performance improvements from "Before" to "After" are due to the changed-path Bloom filters. | | Entire History | First Result | | Path | Before | After | Before | After | |------------------------------|--------|--------|--------|--------| | MAINTAINERS | 4.26 s | 3.87 s | 0.41 s | 0.39 s | | fs/namei.c | 1.99 s | 0.99 s | 0.42 s | 0.21 s | | arch/x86/kvm/cpuid.c | 5.28 s | 1.12 s | 0.16 s | 0.09 s | | fs/io_uring.c | 4.34 s | 0.99 s | 0.94 s | 0.27 s | | arch/x86/kvm/vmx/vmx.c | 5.01 s | 1.34 s | 0.21 s | 0.12 s | | arch/x86/kvm/x86.c | 2.24 s | 1.18 s | 0.21 s | 0.14 s | | fs/btrfs/disk-io.c | 1.82 s | 1.01 s | 0.06 s | 0.05 s | | Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s | Some splashy numbers include a 110x speedup for finding the first result for Documentation/scsi/index.rst. That's certainly an outlier, but the rest have an at least 10x speedup. Thanks, -Stolee Derrick Stolee (7): bloom: fix whitespace around tab length test-bloom: fix usage typo Documentation: changed-path Bloom filters use byte words bloom: de-duplicate directory entries bloom: parse commit before computing filters bloom: use num_changes not nr for limit detection line-log: integrate with changed-path Bloom filters SZEDER Gábor (5): completion: offer '--(no-)patch' among 'git log' options line-log: remove unused fields from 'struct line_log_data' t4211-line-log: add tests for parent oids line-log: more responsive, incremental 'git log -L' line-log: try to use generation number-based topo-ordering .../technical/commit-graph-format.txt | 8 +-- bloom.c | 61 ++++++++++++----- bloom.h | 5 +- contrib/completion/git-completion.bash | 1 + line-log.c | 43 +++++++++++- line-log.h | 5 +- revision.c | 43 ++++++++++-- t/helper/test-bloom.c | 2 +- t/t0095-bloom.sh | 6 +- t/t4211-line-log.sh | 68 +++++++++++++++++++ 10 files changed, 201 insertions(+), 41 deletions(-) base-commit: 1b4c57fa87e121f155863f898dc39d06cf4a1d99 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-622%2Fderrickstolee%2Flog-l-on-bloom-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-622/derrickstolee/log-l-on-bloom-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/622 -- gitgitgadget