"Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > Hey! > > The commit graph feature brought in a lot of performance improvements across > multiple commands. However, file based history continues to be a performance > pain point, especially in large repositories. > > Adopting changed path bloom filters has been discussed on the list before, > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. > Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and > presents an updated and more polished RFC version of the feature. It is nice to have this picked up for upstream, finally. The proof of concept works[1][2] were started more than a year ago. On the other hand slow and steady adoption of commit-graph serialization and then extending it (generation numbers, topological sort, incremental update) feels like a good approach. > Performance Gains: We tested the performance of 'git log -- <path>' on the git > repo, the linux repo and some internal large repos, with a variety of paths > of varying depths. > > On the git and linux repos: We observed a 2x to 5x speed up. > > On a large internal repo with files seated 6-10 levels deep in the tree: We > observed 10x to 20x speed ups, with some paths going up to 28 times faster. Could you provide some more statistics about this internal repository, such as number of files, number of commits, perhaps also number of all objects? Thanks in advance. I wonder why such large difference in performance 2-5x vs 10-20x. Is it about the depth of the file hierarchy? How would the numbers look for files seated closer to the root in the same large repository, like 3-5 levels deep in the tree? > Future Work (not included in the scope of this series): > > 1. Supporting multiple path based revision walk I wonder if it would ever be possible to support globbing, e.g. '*.c' > 2. Adopting it in git blame logic. What about 'git log --follow <path>'? > 3. Interactions with line log git log -L > > This series is intended to start the conversation and many of the commit > messages include specific call outs for suggestions and thoughts. > > Cheers! Garima Singh > > [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@xxxxxxxxx/ > [2] https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@xxxxxxxxx/ > > Garima Singh (9): > commit-graph: add --changed-paths option to write This summary is not easy to understand on first glance. Maybe: commit-graph: add --changed-paths option to the write subcommand or commit-graph: add --changed-paths option to 'git commit-graph write' would be better? > commit-graph: write changed paths bloom filters > commit-graph: use MAX_NUM_CHUNKS > commit-graph: document bloom filter format > commit-graph: write changed path bloom filters to commit-graph file. > commit-graph: test commit-graph write --changed-paths > commit-graph: reuse existing bloom filters during write. > revision.c: use bloom filters to speed up path based revision walks > commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag > > Documentation/git-commit-graph.txt | 5 + > .../technical/commit-graph-format.txt | 17 ++ > Makefile | 1 + > bloom.c | 257 +++++++++++++++++ > bloom.h | 51 ++++ > builtin/commit-graph.c | 9 +- > ci/run-build-and-tests.sh | 1 + > commit-graph.c | 116 +++++++- > commit-graph.h | 9 +- > revision.c | 67 ++++- > revision.h | 5 + > t/README | 3 + > t/helper/test-read-graph.c | 4 + > t/t4216-log-bloom.sh | 77 ++++++ > t/t5318-commit-graph.sh | 2 + > t/t5324-split-commit-graph.sh | 1 + > t/t5325-commit-graph-bloom.sh | 258 ++++++++++++++++++ > 17 files changed, 875 insertions(+), 8 deletions(-) > create mode 100644 bloom.c > create mode 100644 bloom.h > create mode 100755 t/t4216-log-bloom.sh > create mode 100755 t/t5325-commit-graph-bloom.sh > > > base-commit: b02fd2accad4d48078671adf38fe5b5976d77304 > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/497