[PoC PATCH 00/34] An alternative modified path Bloom filters implementation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sigh...  but better late than never, right?

I experimented quite a bit with modified path Bloom filters a year and
more ago, and got quite far...  but my disappointment in the
inadequacy of all double hashing schemes, the arrival of split
commit-graphs, and, well, life in general has put the whole thing on
the back burner, and I haven't touched it for a couple of releases.

Now I finally managed to take a closer look at the current changed
paths Bloom filters implementation, and saw that it has some of the
same issues that I had stumbled upon and that it missed some
optimization opportunities.  Unfortunately, fixing those issues and
performing those optimizations do require a thorough format change.

So here is my proof of concept version, in all its incompleteness,
with the following benefits:

  - Better understanding of the problem it tries to optimize.
  - Better understanding of the issues with many small Bloom filters.
  - Better hashing scheme (though it should be better still).
  - Orders of magnitude lower average false positive rate.
  - Consequently, faster pathspec-limited revision walks.
  - Faster processing of the tree-diff output and lower memory usage
    while computing Bloom filters (from scratch...).
  - Optional support for storing Bloom filters for all parents of
    merge commits.
  - Deduplicates Bloom filters.
  - Supports multiple pathspecs right from the start.
  - Supports some wildcards in pathspecs.
  - Handles as many commits as the commit-graph format can.
  - It has the right name :)  The diff machinery and all its frontends
    report "modified" paths with the letter 'M', not "changed".
  - More cleanups, more bugfixes.
  - Consistent output with and without modified path Bloom filters for
    over 80k random paths in 16 repositories, even with submodules in
    them.  Well, at least on my machine, if nowhere else...

Alas, the drawbacks are significant:

  - No tests whatsoever.
  - Computes all modified path Bloom filters from scratch when
    writing, no matter what.
  - Doesn't work with split commit-graphs.
  - Basically if anything works besides 'git commit-graph write
    --reachable' it's a miracle.
  - Not a single test.
  - Many BUG()s, which should rather be graceful errors...  though I
    have to admit that at this point they are indeed bugs.
  - Many TODOs, both in commit messages and code, some incomplete
    commit messages, crappy subject lines, even missing signoffs.
  - Some ridiculously long variable, function, macro and config
    variable names.
  - It's based on v2.25.0 (no technical reason, but that's the version
    I used to run the baseline benchmarks the last time, which takes
    days...)
  - I'm pretty sure that there are more...
  - Oh, did I mention that there are no tests?


The first 14 patches are preparatory fixes and cleanups:

  01/34 tree-walk.c: don't match submodule entries for 'submod/anything'

This fix or something similar is necessary to have consistent output
with and without modified path Bloom filters for paths crossing
submodule boundary.

  02/34 commit-graph: fix parsing the Chunk Lookup table

The minimal (though not the best) fix for a bug which, I think, is as
old as the commit-graph.  I don't know how to test this.

  03/34 commit-graph-format.txt: all multi-byte numbers are in network byte order
  04/34 commit-slab: add a function to deep free entries on the slab
  05/34 diff.h: drop diff_tree_oid() & friends' return value
  06/34 commit-graph: clean up #includes

A couple of minor cleanups.

  07/34 commit-graph: simplify parse_commit_graph() #1
  08/34 commit-graph: simplify parse_commit_graph() #2

These two would be the right, though not minimal fix for the parsing
bug above.

  09/34 commit-graph: simplify write_commit_graph_file() #1
  10/34 commit-graph: simplify write_commit_graph_file() #2
  11/34 commit-graph: allocate the 'struct chunk_info' array dinamically

I think these three cleanup patches are a better alternative of
3be7efcafc (commit-graph: define and use MAX_NUM_CHUNKS, 2020-03-30),
because...

  12/34 commit-graph: unify the signatures of all write_graph_chunk_*() functions
  13/34 commit-graph: simplify write_commit_graph_file() #3
  14/34 commit-graph: check chunk sizes after writing

... they laid the ground work for this patch.

  15/34 commit-graph-format.txt: document the modified path Bloom filter chunks

This is the most important one, specifying and _justifying_ the new
chunk formats.

Do grab a cup or pint of your favourite beverage and get comfy before
reading this one.  You have been warned.

  16/34 Add a generic and minimal Bloom filter implementation
  17/34 Import a streaming-capable Murmur3 hash function implementation
  18/34 commit-graph: write "empty" Modified Path Bloom Filter Index chunk
  19/34 commit-graph: add commit slab for modified path Bloom filters
  20/34 commit-graph: fill the Modified Path Bloom Filter Index chunk

This shows a more efficient approach to process the tree-diff output
into Bloom filters.

  21/34 commit-graph: load and use the Modified Path Bloom Filter Index chunk
  22/34 commit-graph: write the Modified Path Bloom Filters chunk
  23/34 commit-graph: load and use the Modified Path Bloom Filters chunk
  24/34 commit-graph: check all leading directories in modified path Bloom filters

This was a good lightbulb moment.  It is essential to try to maintain
reasonable performance in repositories where the vast majority of
changes are concentrated to a single directory.

  25/34 commit-graph: check embedded modified path Bloom filters with a mask
  26/34 commit-graph: deduplicate modified path Bloom filters
  27/34 commit-graph: load modified path Bloom filters for merge commits
  28/34 commit-graph: write Modified Path Bloom Filter Merge Index chunk
  29/34 commit-graph: extract init and free write_commit_graph_context
  30/34 commit-graph: move write_commit_graph_reachable below write_commit_graph
  31/34 t7007-show: make the first test compatible with the next patch
  32/34 PoC commit-graph: use revision walk machinery for '--reachable'

Once upon a time I thought this was the greatest idea ever, but as
time goes by I get more and more concerned that this is a really dumb
idea, though don't yet know why.

  33/34 commit-graph: write modified path Bloom filters in "history order"
  34/34 commit-graph: use modified path Bloom filters with wildcards, if possible

Finally a cherry on top.



SZEDER Gábor (34):
  tree-walk.c: don't match submodule entries for 'submod/anything'
  commit-graph: fix parsing the Chunk Lookup table
  commit-graph-format.txt: all multi-byte numbers are in network byte
    order
  commit-slab: add a function to deep free entries on the slab
  diff.h: drop diff_tree_oid() & friends' return value
  commit-graph: clean up #includes
  commit-graph: simplify parse_commit_graph() #1
  commit-graph: simplify parse_commit_graph() #2
  commit-graph: simplify write_commit_graph_file() #1
  commit-graph: simplify write_commit_graph_file() #2
  commit-graph: allocate the 'struct chunk_info' array dinamically
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify write_commit_graph_file() #3
  commit-graph: check chunk sizes after writing
  commit-graph-format.txt: document the modified path Bloom filter
    chunks
  Add a generic and minimal Bloom filter implementation
  Import a streaming-capable Murmur3 hash function implementation
  commit-graph: write "empty" Modified Path Bloom Filter Index chunk
  commit-graph: add commit slab for modified path Bloom filters
  commit-graph: fill the Modified Path Bloom Filter Index chunk
  commit-graph: load and use the Modified Path Bloom Filter Index chunk
  commit-graph: write the Modified Path Bloom Filters chunk
  commit-graph: load and use the Modified Path Bloom Filters chunk
  commit-graph: check all leading directories in modified path Bloom
    filters
  commit-graph: check embedded modified path Bloom filters with a mask
  commit-graph: deduplicate modified path Bloom filters
  commit-graph: load modified path Bloom filters for merge commits
  commit-graph: write Modified Path Bloom Filter Merge Index chunk
  commit-graph: extract init and free write_commit_graph_context
  commit-graph: move write_commit_graph_reachable below
    write_commit_graph
  t7007-show: make the first test compatible with the next patch
  PoC commit-graph: use revision walk machinery for '--reachable'
  commit-graph: write modified path Bloom filters in "history order"
  commit-graph: use modified path Bloom filters with wildcards, if
    possible

 Documentation/config/core.txt                 |   19 +
 .../technical/commit-graph-format.txt         |  127 +-
 Makefile                                      |    2 +
 bloom-filter.c                                |   91 ++
 bloom-filter.h                                |   47 +
 commit-graph.c                                | 1239 +++++++++++++++--
 commit-graph.h                                |   24 +-
 commit-slab-decl.h                            |    1 +
 commit-slab-impl.h                            |   13 +
 commit-slab.h                                 |   10 +
 compat/PMurHash.c                             |  291 ++++
 compat/PMurHash.h                             |   62 +
 diff.h                                        |   10 +-
 pathspec.c                                    |   10 +
 pathspec.h                                    |   13 +
 revision.c                                    |   27 +-
 shallow.c                                     |   14 +-
 t/t4010-diff-pathspec.sh                      |    4 +-
 t/t5318-commit-graph.sh                       |    3 +-
 t/t7007-show.sh                               |    7 +-
 tree-diff.c                                   |   21 +-
 tree-walk.c                                   |    9 +-
 22 files changed, 1872 insertions(+), 172 deletions(-)
 create mode 100644 bloom-filter.c
 create mode 100644 bloom-filter.h
 create mode 100644 compat/PMurHash.c
 create mode 100644 compat/PMurHash.h

-- 
2.27.0.rc1.431.g5c813f95dc




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux