Significant performance waste in git-svn and friends

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Being a pervert abusing the way subversion doesn't deal with branches
and tags, I'm actually not a user of git-svn or git-svnimport, because
they just can't deal easily with my perversion. So I'm writing a script
to do the conversion for me, and since I also like to learn new things
when I'm coding, I'm writing it in ruby.

Anyways, one of the things I'm trying to convert is my svk repository
for debian packaging of xulrunner (so, a significant subset of the
mozilla tree), which doesn't involve a lot of revisions (around 280,
because I only imported releases or CVS snapshots), but involves a lot
of files (roughly 20k).

The first thing I noticed when twisting around the svk repo so that
git-svn could somehow import it a while ago, is that running git-svn
was in my case significantly slower than svnadmin dump | svnadmin load
(more than 2 times slower).

And now, with my own script, I got the same kind of "slowdown". So I
investigated it, and it didn't take long to realize that replacing
git-hash-object by a simple reimplementation in ruby was *way* faster.
git-hash-object being more than probably what you do the most when you
import a remote repository, it is not much of a surprise that forking
thousands of times is a huge performance waste.

So, just for the record, I did a lame hack of git-svn to see what kind
of speedup could happen in git-svn. You can find this lame hack as a
patch below. I did some tests (with a 1.5.2.1 release) and here are the
results, importing only the trunk (192 revisions), with no checkout, and
redirecting stdout to /dev/null:

original git-svn:
real    25m1.871s
user    8m51.593s
sys     12m31.659s

patched git-svn:
real    14m45.870s
user    7m31.928s
sys     4m1.047s

Some notes about the patch:
- I've not looked at the rest of the code to see if there was a way to
  get the size of the file so that SHA-1 sum and compression could be
  done in one pass and without copying the whole file in memory.
- The object creation in the .git/objects directory is not as safe as
  what git-hash-object does.

Some notes about git-svn:
- A few lines above the patched zone, the file is already read once to
  do the MD5 sum. It should be possible to do SHA-1, MD5 sums and
  deflate in just one pass.
- It might be worth testing if git-cat-file is called a lot. If so,
  implementing a simple git-cat-file equivalent that would work for
  unpacked objects could improve speed.

The same things obviously apply to git-cvsimport and other scripts
calling git-hash-object a lot.

Mike

diff --git a/git-svn.perl b/git-svn.perl
index d3c8cd0..202c228 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -2417,6 +2417,8 @@ use warnings;
 use Carp qw/croak/;
 use IO::File qw//;
 use Digest::MD5;
+use Digest::SHA1;
+use Compress::Zlib;
 
 # file baton members: path, mode_a, mode_b, pool, fh, blob, base
 sub new {
@@ -2603,15 +2605,26 @@ sub close_file {
 			$buf eq 'link ' or die "$path has mode 120000",
 			                       "but is not a link\n";
 		}
-		defined(my $pid = open my $out,'-|') or die "Can't fork: $!\n";
-		if (!$pid) {
-			open STDIN, '<&', $fh or croak $!;
-			exec qw/git-hash-object -w --stdin/ or croak $!;
+		my $size = 0;
+		my $buf = "";
+		while (my $read = sysread $fh, my $tmp, 4096) {
+			$size += $read;
+			$buf .= $tmp;
 		}
-		chomp($hash = do { local $/; <$out> });
-		close $out or croak $!;
+		my $sha1 = Digest::SHA1->new;
+		$sha1->add("blob $size\0");
+		$sha1->add($buf);
+		$hash = $sha1->hexdigest;
 		close $fh or croak $!;
 		$hash =~ /^[a-f\d]{40}$/ or die "not a sha1: $hash\n";
+		my $blob_dir = "$ENV{GIT_DIR}/objects/" . substr($hash, 0, 2);
+		my $blob_file = $blob_dir . "/" . substr($hash, 2);
+		if (! -f $blob_file) {
+			mkdir $blob_dir unless -d $blob_dir;
+			open BLOB, ">$blob_file";
+			print BLOB compress("blob $size\0" . $buf);
+			close BLOB;
+		}
 		close $fb->{base} or croak $!;
 	} else {
 		$hash = $fb->{blob} or die "no blob information\n";
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux