Re: filter-branch performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9 December 2014 at 18:59, Jeff King <peff@xxxxxxxx> wrote:
> On Tue, Dec 09, 2014 at 07:52:33PM +0100, Henning Moll wrote:
>> I assume that there is a lot of process forking going on. Could that be the
>> cause?
>
> Yes. filter-branch is a shell scripts, and it is probably running
> multiple git commands per commit it is filtering.
>
>> Any ideas how to further improve?

Depending on how much time you can sink into improving the performance
(versus just allowing the process to run to completion), you could
also look into a non-forking solution, as well as not bothering to
load the commit trees. To me non-forking means putting everything into
the JVM by using JGit, like the BFG does, though libgit2 might also be
an option.

Changing the BFG's code to do the transformation in your script is
absolutely trivial - define a commit-node cleaner like this:

object SetCommitterToAuthor extends CommitNodeCleaner {
  override def fixer(kit: CommitNodeCleaner.Kit) = c =>
c.copy(committer = c.author) // PersonIdent class holds name, email &
time
}

...trivial if you don't mind compiling Scala with SBT that is, and I'm
sure some people do! A DSL for non-Scala people to define their own
BFG scripts would be good, I must get on that some day.

The BFG is generally faster than filter-branch for 3 reasons:

1. No forking - everything stays in the JVM process
2. Embarrassingly parallel algorithm makes good use of multi-core machines
3. Memoization means no Git object (file or folder) is cleaned more than once

In the case of your problem, only the first factor will be noticeably
helpful. Unfortunately commits do need to be cleaned sequentially, as
their hashes depend on the hashes of their parents, and filter-branch
doesn't clean /commits/ more than once, the way it does with files or
folders - so the last 2 reasons in the list won't be significant.

For your specific use case tho', the fact that BFG doesn't load the
file tree at all unless it needs to clean it will also help.

I decided to knock up an egregious hack in the BFG to see what
performance would be like. I ran it against a fairly large repo
(https://github.com/bfg-repo-cleaner-demos/intellij-community-original),
100k commits, stored in /dev/shm, and used the SetCommitterToAuthor
code above. The BFG run completed in 31.7 seconds, you can see the
resulting repo here:

https://github.com/rtyley/intellij-community-set-committer-to-author

I started running the same test some time ago using filter-branch,
unfortunately that test has not completed yet - the BFG appears to be
substantially faster.

Before:
$ git cat-file -p b02bf46c4e93c2e8570910cdd68eb6f4ce21ff81
tree 7a412e49ecdbd966d7efe5fe746ff3ea3b6067d1
parent 8794219e3e84aed3cc8af926ffd74beafa51fb6b
author peter <peter@xxxxxxxxxxxxx> 1370854045 +0200
committer peter <peter@xxxxxxxxxxxxx> 1370854098 +0200

After:
$ git cat-file -p 3adb7b2a5c87320a5a028b6a59a7132c75a6e91c
tree 7a412e49ecdbd966d7efe5fe746ff3ea3b6067d1
parent 5efcdb551789b0d0bb541de9325f09521c5fbcb6
author peter <peter@xxxxxxxxxxxxx> 1370854045 +0200
committer peter <peter@xxxxxxxxxxxxx> 1370854045 +0200 <- time fixed

The relevant code is in:
https://github.com/rtyley/bfg-repo-cleaner/compare/set-committer-to-author
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]