Re: [PATCH 0/4] Bloom filter experiment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/9/2018 3:34 PM, SZEDER Gábor wrote:
To keep the ball rolling, here is my proof of concept in a somewhat
cleaned-up form, with still plenty of rough edges.

Peff, Szeder, and Jonathan,

Thanks for giving me the kick in the pants to finally write a proof of concept for my personal take on how this should work. My implementation borrows things from both Szeder and Jonathan's series. You can find my commits for all of the versions on GitHub (it's a bit too messy to share as a patch series right now, I think):

Repo: https://github.com/derrickstolee/git
Branches: bloom/* (includes bloom/stolee, bloom/peff, bloom/szeder, and bloom/tan for the respective implementations, and bloom/base as the common ancestor)

My implementation uses the following scheme:

1. Bloom filters are computed and stored on a commit-by-commit basis.

2. The filters are sized according to the number of changes in each commit, with a minimum of one 64-bit word.

3. The filters are stored in the commit-graph using two new optional chunks: one stores a single 32-bit integer for each commit that provides the end of its Bloom filter in the second "data" chunk. The data chunk also stores the magic constants (hash version, num hash keys, and num bits per entry).

4. We fill the Bloom filters as (const char *data, int len) pairs as "struct bloom_filter"s in a commit slab.

5. In order to evaluate containment, we need the struct bloom_filter, but also struct bloom_settings (stores the magic constants in one place), and struct bloom_key (stores the _k_ hash values). This allows us to hash a path once and test the same path against many Bloom filters.

6. When we compute the Bloom filters, we don't store a filter for commits whose first-parent diff has more than 512 paths.

7. When we compute the commit-graph, we can re-use the pre-existing filters without needing to recompute diffs. (Caveat: the current implementation will re-compute diffs for the commits with diffs that were too large.)

You can build the Bloom filters in my implementation this way:

GIT_TEST_BLOOM_FILTERS=1 ./git commit-graph write --reachable

You can play around with it like this:

   $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write
   Computing commit graph generation numbers: 100% (52801/52801), done.
   Computing bloom filter: 100% (52801/52801), done.
   # Yeah, I even added progress indicator! :)
   $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh
   886
   20:40:24.783699 revision.c:486          bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873

Jonathan used this same test, so will I. Here is a summary table:

| Implementation | Queries | Maybe | FP # | FP %  |
|----------------|---------|-------|------|-------|
| Szeder         | 66095   | 1142  | 256  | 0.38% |
| Jonathan       | 66459   | 107   | 89   | 0.16% |
| Stolee         | 53025   | 492   | 479  | 0.90% |

(Note that we must have used different starting points, which is why my "Queries" is so much smaller.)

The increase in false-positive percentage is expected in my implementation. I'm using the correct filter sizes to hit a <1% FP ratio. This could be lowered by changing the settings, and the size would dynamically grow. For my Git repo (which contains git-for-windows/git and microsoft/git) this implementation grows the commmit-graph file from 5.8 MB to 7.3 MB (1.5 MB total, compared to Szeder's 8MB filter). For 105,260 commits, that rounds out to less than 20 bytes per commit (compared to Jonathan's 256 bytes per commit).

Related stats for my Linux repo: 781,756 commits, commit-graph grows from 43.8 to 55.6 MB (~12 MB additional, ~16 bytes per commit).

I haven't done a side-by-side performance test for these implementations, but it would be interesting to do so.

Despite writing a lot of code in a short amount of time, there is a lot of work to be done before this is submittable:

1. There are three different environment variables right now. It would be better to have one GIT_TEST_ variable and rely on existing tracing for logs (trace2 values would be good here).

2. We need config values for writing and consuming bloom filters, but also to override the default settings.

3. My bloom.c/bloom.h is too coupled to the commit-graph. I want to harden that interface to let Bloom filters live as their own thing, but that the commit-graph could load a bloom filter from the file instead of from the slab.

4. Tests, tests, and more tests.

We'll see how much time I have to do this polish, but I think the benefit is proven.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux