Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters

Jakub Narebski <jnareb@xxxxxxxxx> · Sun, 02 Feb 2020 00:32:30 +0100

Emily Shaffer <emilyshaffer@xxxxxxxxxx> writes:

[...]
> Speaking of making sure I understand the change right, I will also
> summarize my own understanding in the hopes that I can be corrected and
> others can learn too ;)
>
>  - The general idea is that we can write down a hint at the tree level
>    saying "something I own did change this commit"; when we look for an
>    object later, we can skip the commit where it looks like that path
>    didn't change.

Or, to be more exact, we write hint about all the files and directories
changed in the commit at the commit level.

Say, for example, that changed files are 'README' and 'subdir/file'.
We store hint that 'README', 'subdir/file' and 'subdir' paths have
changes in them.

>  - The change is in two pieces: first, to generate the hints per tree
>    (which happens during commit-graph); second, to use those hints to
>    optimize a rev walk (which happens in revision.c patch 8)

Right.

>  - When we calculate the hints during commit-graph, we check the diff of
>    each tree compared to its recent ancestor to see if there was a
>    change;

s/recent ancestor/first parent/

Right, though commits without any changes with respect to first-parent
(or null tree in case of root i.e. parentless commit) should be rare.

>           if so we calculate a hash for each path and use that as a key
>    for a map from hash to path.

Yes, and no.  Here we enter the details how Bloom filter is constructed.
We don't store paths in Bloom filter -- it would take too much space.

The hashmap is used as an implementation of mathematical set, a
temporary structure used during Bloom filter construction.  We could
have used string_list, but then we could waste time trying to add
intermediate directories multiple times (for example if 'foo/bar' and
'foo/baz' files changed, we need to add 'foo' path only once to Bloom
filter).

You can think of Bloom filter as a compact (and probabilistic, see
below) representation of set of changed paths.

>                                After we look through everything changed
>    in the diff, we can add it to a cumulative bloom filter list (one
>    filter per commit) so we have a handy in-memory idea of which paths
>    changed in each commit.

Yes.

>  - When it's time to do the rev walk, we ask for the bloom filter for
>    each commit and check if that commit's map contains the path to the
>    object we're worried about; if so, then it's OK to unpack the tree
>    and check if the path we are interested in actually did get changed
>    during that commit.

>From the point of view of rev walk, we ask for the Bloom filter for each
commit walked, and check if the (sub)set of changed paths includes given
path.

Bloom filter can answer "no" -- then we can skip the commit simplifying
history, or it can answer "maybe" -- then we need to check if file was
actually changed unpacking the trees (there is around 1% probability
that Bloom filter will say "maybe" if the path is not actually changed).

>From the point of view of Bloom filter, if the path was actually changed
the filter will always answer "maybe".  If the path was not changed,
then in most cases the filter will answer "no" but there is 1% of chance
that it will answer "maybe".

I hope that helps,
--
Jakub Narębski