Derrick Stolee <stolee@xxxxxxxxx> writes: > On 4/7/2018 2:40 PM, Jakub Narebski wrote: >> Derrick Stolee <dstolee@xxxxxxxxxxxxx> writes: >> >> [...] >>> On the Linux repository, performance tests were run for the following >>> command: >>> >>> git log --graph --oneline -1000 >>> >>> Before: 0.92s >>> After: 0.66s >>> Rel %: -28.3% >>> >>> Adding '-- kernel/' to the command requires loading the root tree >>> for every commit that is walked. There was no measureable performance >>> change as a result of this patch. >> >> In the "Git Merge contributor summit notes" [1] one can read that: >> >>> - VSTS adds bloom filters to know which paths have changed on the commit >>> - tree-same check in the bloom filter is fast; speeds up file history checks >>> - if the file history is _very_ sparse, then bloom filter is useful >> >> Could this method speed up also the second case mentioned here? Can >> anyone explain how this "path-changed bloom filter" works in VSTS? >> > > The idea is simple: for every commit, store a Bloom filter containing > the list of paths that are not TREESAME against the first parent. (A > slight detail: have a max cap on the number of paths, and store simply > "TOO_BIG" for commits with too many diffs.) Isn't Bloom filter with all bits set to 1 equivalent of "too big"? It would answer each query with "possibly in set, check it". > When performing 'git log -- path' queries, the most important detail > for considering how to advance the walk is whether the commit is > TREESAME to its first parent. For a deep path in a large repo, this is > almost always true. When a Bloom filter says "TREESAME" (i.e. "this > path is not in my set") it is always correct, so we can set the > treesame bit and continue without walking any trees. When a Bloom > filter says "MAYBE NOT TREESAME" (i.e. "this path is probably in my > set") you only need to do the same work as before: walk the trees to > compare against your first parent. > > If a Bloom filter has a false-positive rate of X%, then you can > possibly drop your number of tree comparisons by (100-X)%. This is > very important for large repos where some paths were changed only ten > times or so, the full graph needs to be walked and it is helpful to > avoid parsing too many trees. So this works only for exact file pathnames, or does checking for subdirectory also work? I guess it cannot work for globbing patterns (like "git log -- '*.c'"). I guess that it speeds up "git log -- <pathname>", but also "git blame <pathname>". Sidenote: I have stumbled upon https://github.com/efficient/ffbf project (fast-forward Bllom filters - for exact matching of large number of patterns), where they invent cache-efficient version of Bloom filter algorithm. The paper that the project is based on also might be interesting, if only because it points to other improvements of classic Bloom filters. >> Could we add something like this to the commit-graph file? > > I'm not sure if it is necessary for client-side operations, but it is > one of the reasons the commit-graph file has the idea of an "optional > chunk". It could be added to the file format (without changing version > numbers) and be ignored by clients that don't understand it. I could > also be gated by a config setting for computing them. My guess is that > only server-side operations will need the added response time, and can > bear the cost of computing them when writing the commit-graph > file. Clients are less likely to be patient waiting for a lot of diff > calculations. So the problem is performing large amount of diff calculations. I wonder if it would be possible to add Bloom filters incrementally, when for some reason we need to calculate diff anyway. > > If we add commit-graph file downloads to the protocol, then the server > could do this computation and send the data to all clients. But that > would be "secondary" information that maybe clients want to verify, > which is as difficult as computing it themselves. Can you tell us how much work (how much time) must spend the server to calculate those per-commit "files changed" Bloom fillters? Best, -- Jakub Narębski