On 10/9/2018 3:34 PM, SZEDER Gábor wrote:
To keep the ball rolling, here is my proof of concept in a somewhat cleaned-up form, with still plenty of rough edges. You can play around with it like this: $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write Computing commit graph generation numbers: 100% (52801/52801), done. Computing bloom filter: 100% (52801/52801), done. # Yeah, I even added progress indicator! :) $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh 886 20:40:24.783699 revision.c:486 bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873 The value of $GIT_USE_POC_BLOOM_FILTER only really matters when writing the Bloom filter, and it specifies the number of bits in the filter's bitmap, IOW the above command creates a 8MB Bloom filter. To make use of the filter the variable can be anything non-empty. Writing the Bloom filter is very slow as it is (yeah, that's why bothered with the progress indicator ;). I wrote about it in patch 2's commit message: the cause for about half of the slowness is rather obvious, but I don't (yet) know what's responsible for the other half. Not a single test... but I've run loops over all files in git.git comparing 'git rev-list HEAD -- $file's output with and without the Bloom filter, and, surprisingly, they match. My quick'n'dirty experiments usually don't fare this well... It's also available at: https://github.com/szeder/git bloom-filter-experiment
Thanks! I will take a close look at this tomorrow and start playing with it. -Stolee