Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/21/2020 6:40 PM, Emily Shaffer wrote:>> Garima Singh (9):
>>   [1/9] commit-graph: add --changed-paths option to write
> I wonder if this can be combined with 2; without 2 actually the
> documentation is wrong for this one, right? Although I suppose you also
> mentioned 2 perhaps being too long :)
> 

True. The ordering of these commits has been a very subjective discussion. 
Leaving this commit isolated the way it is, does help separate the option
from the bloom filter computation and commit graph write step. 

Also for clarity, I have changed the message for 2/9 to:
`commit-graph: compute Bloom filters for changed paths`

 
>>   [2/9] commit-graph: write changed paths bloom filters
> As I understand it, this one always regenerates the bloom filter pieces,
> and doesn't write it down in the commit-graph file. How much longer does
> that take now than before? I don't have a great feel for how often 'git
> commit-graph' is run, or whether I need to be invoking it manually.
> 

Yes. Computation and writing the bloom filters to the commit-graph file
(2/9 and 5/9) are ideally one time operations per commit. The time taken
depends on the shape and size of the repo since computation involves 
running a full diff. If you look at the discussions and patches exchanged
by Dr. Stolee and Peff: it has improved greatly since this RFC patch. 
My next submission will carry these patches and more concrete numbers for
time and memory. 

Also, `git commit-graph write` is not run very frequently and is ideally
incremental. A full rewrite is usually only done in case there is a 
corruption caught by `git commit-graph verify` in which case, you delete
and rewrite. These are both manual operations. 
See the docs here: 
https://git-scm.com/docs/git-commit-graph

Note: that fetch.writeCommitGraph is now on by default in which case 
new computations happen automatically for newly fetched commits. But 
I am not adding the --changed-paths option to that yet. So computing, 
writing and using bloom filters will still be an opt-in feature 
that users need to manually run. 
 
>>   [7/9] commit-graph: reuse existing bloom filters during write.
> I saw an option to give up if there wasn't an existing bloom filter, but
> I didn't see an option here to force recalculating. 

The way to force recalculation at the moment would be to delete the commit
graph file and write it again. 

Is there a scenario when that would be useful? 

Yes, if there are algorithm or hash version changes in the changed paths logic
we would need to rewrite. Each of the cases I can think of would involve
triggering a recalculation by deleting and rewriting.

What's the mitigation path if:
>  - I have a commit-graph with v0 of the bloom index piece, but update to
>    Git which uses v1?     
   
     Take a look at 5/9, when we are parsing the commit graph: if the code 
     is expected to work with a particular version (hash_version = 1 in the 
     current version), and the commit graph has a different version, we just 
     ignore it. In the future, this is where we could extend this to support 
     multiple versions.

>  - My commit-graph file is corrupted in a way that the bloom filter
>    results are incorrect and I am missing a blob change (and therefore
>    not finding it during the walk)?
         The mitigation for any commit graph corruption is to delete and 
     rewrite. 

     If however, we are confident that the bloom filter computation itself
     is wrong, the immediate mitigations would be to deleting the commit
     graph file and rewriting without the --changed-paths option; and ofc
     report the bug so it can be investigated and fixed. :) 

> Speaking of making sure I understand the change right, I will also
> summarize my own understanding in the hopes that I can be corrected and
> others can learn too ;)
> 
>  - The general idea is that we can write down a hint at the tree level
>    saying "something I own did change this commit"; when we look for an
>    object later, we can skip the commit where it looks like that path
>    didn't change.

The hint we are storing is a bloom filter which answers "No" or "Maybe"
to the question "Did file A change in commit c"
If the answer is No, we can ignore walking that commit. Else, we fall
back to the diff algorithm like before to confirm if the file changed or
not. 

>  - The change is in two pieces: first, to generate the hints per tree
>    (which happens during commit-graph); second, to use those hints to
>    optimize a rev walk (which happens in revision.c patch 8)

Yes. 

>  - When we calculate the hints during commit-graph, we check the diff of
>    each tree compared to its recent ancestor to see if there was a
>    change; if so we calculate a hash for each path and use that as a key
>    for a map from hash to path. After we look through everything changed
>    in the diff, we can add it to a cumulative bloom filter list (one
>    filter per commit) so we have a handy in-memory idea of which paths
>    changed in each commit.
>  - When it's time to do the rev walk, we ask for the bloom filter for
>    each commit and check if that commit's map contains the path to the
>    object we're worried about; if so, then it's OK to unpack the tree
>    and check if the path we are interested in actually did get changed
>    during that commit.

Essentially yes. There are a few implementation specifics this description
is glossing over, but I understand that is the intention. 

 
> Thanks.
>  - Emily

Cheers! 
Garima Singh



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux