On 9 April 2013 18:01, Jeff King <peff@xxxxxxxx> wrote: > On Tue, Apr 09, 2013 at 08:03:24AM +0200, Johannes Sixt wrote: >> If A mentions B (think of cherry-pick -x), then you must ensure that the >> branch containing B was traversed first. > > Yeah, you're right. Multiple passes are necessary to get it > completely right. And because each pass may change more commit id's, you > have to recurse to pick up those changes, and keep going until you have > a pass with no changes. Just to give some context on how the BFG handles this (without doing multiple passes): The BFG makes a design choice (based on it's intended use-case of annihilating unwanted data) that a specific tree or blob will always be cleaned in exactly the same way - because when you're trying to get rid of large blobs or private data, you most likely /don't care/ where it is, what commit it belongs to, how old it is. The id for a cleaned tree or blob is always the same no matter where it came from, and so the BFG maintains a in-memory mapping of 'dirty' to 'clean' object ids while cleaning a repo - whenever an object (commit, tag, tree, blob) is cleaned, these values are stored in the map: dirty-id -> clean-id clean-id -> clean-id (in terms of memory overhead, this amounts to only ~ 128MB for even quite a large repo like the linux kernel, so I don't spend much time worrying about it) The map memoises the cleaning functions on all objects, so an object (particularly a tree) never gets cleaned more than once, which is one of the things that makes the BFG fast. Having these memoised functions makes cleaning commit messages fairly easy - the message is grepped for hex strings more than a few characters in length, and if a matched string resolves uniquely to an object id in the repo, the clean() method is called on it to get the cleaned id - which will either return immediately with a previously calculated result, or if the id came from a different branch, trigger a cascade of more cleaning, eventually returning the required cleaned id. In the case of git-filter-branch, the user has a lot more freedom to change the tree-structure of commits on a commit-by-commit basis, so memoising tree-cleaning is out of the question, but I guess it might be possible to do memoisation of just the commit ids to short-cut the multiple-pass problem. - Roberto Tyley -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html