[PATCH] really plug memory leaks in git-svnimport

Stefan Sperling <stsp@xxxxxxxx> · Fri, 27 Jul 2007 15:09:41 +0200

Hello,

trying to convert the rather large  subversion repo of the DSLinux project
I hit memory leaks in git-svnimport.

The repo has *lots* of files, a workspace is about 2GB in size.
URLs: www.dslinux.org, svnweb at http://dslinux.gits.kiev.ua

Trace showing the program running out of memory:

#0  0x28285f57 in kill () from /lib/libc.so.6
#1  0x28285ef6 in raise () from /lib/libc.so.6
#2  0x282849bb in abort () from /lib/libc.so.6
#3  0x283bd9ea in abort_on_pool_failure (retcode=12)
    at subversion/libsvn_subr/pool.c:46
#4  0x285c8f1f in apr_palloc (pool=0x2801a018, size=53840)
    at memory/unix/apr_pools.c:618
#5  0x28621519 in rep_read_contents (baton=0x2800d388, buf=0x28026018 "",
    len=0xbfbfdb38) at subversion/libsvn_fs_fs/fs_fs.c:1881


I know about the patch at http://marc.info/?l=git&m=114345884526971&w=2
but that is already applied to the version I have here and does not
help because it does not really solve the core of the problem.

The following is rather lengthy, but I guess people applying
my patch want to understand what they're doing so I might just
as well provide all the info upfront.

The core of the problem is that the subversion perl bindings
don't seem to suceed at abstracting dynamic memory handling
away from perl scripts. Ripping out all explicit dynamic memory
handling from git-svnimport (done via the SVN::Pool API) does
not make the problem go away.

I am still not sure whether the right place to fix this
are the subversion perl bindings or the script.
Handling dynamic memory explicitly in a scripting language
is certainly weird. But I cannot get the subversion perl bindings
from current subversion trunk to build on my system to try to find
out if it is possible to fix the problem there
(see http://svn.haxx.se/users/archive-2007-07/0784.shtml).

So I had to try to solve the problem in the script itself.
I was able to fix the script up to the point where I was
able to import the entire DSLinux repository in a single
go without the script running out of memory.

To understand the fix you need to understand how subversion
handles dynamic memory:

Basically, every function in subversion that needs temporary
scratch space gets handed a pointer to a "memory pool".

It can allocate memory from this pool and does not need to care
about freeing anything because only the scope that created a pool
is responsible for destroying it.

So in C this looks a bit like:

svn_error_t* some_func(struct my_struct **result, apr_pool_t *pool)
{
	/* allocate space for result from pool */
	struct my_struct *res = apr_palloc(pool, sizeof(struct my_struct));
	
	/* compute result and pass pool further down if needed */
	
	...

	/* don't care about freeing the pool */
	*result = res;
	return SVN_NO_ERROR;
}

The caller will know when it does not need the result any longer
and destroy the pool at that moment.

Loops are interesting, since we can also "clear" a pool which does
not deallocate the memory it holds but marks the memory ready for re-use.

So with loops the following idiom is used:

	apr_pool_t *subpool = svn_pool_create(pool);		

	while (condition) {
		svn_pool_clear(subpool);

		/* do stuff and pass subpool further down if needed */

		...
	}
	
	svn_pool_destroy(subpool);


So how do these pools work in perl?

Well, the SVN::Core man page states the following:

  The perl bindings significantly simplify the usage of pools, while
  still being manually adjustable.

Great. And what does mean exactly?

  Functions requiring pool as the last argument (which are, almost all of
  the subversion functions), the pool is optionally[sic]. The default pool
  is used if it is omitted. If default pool is not set, a new root pool will
  be created and set as default automatically when the first function
  requiring a default pool is called.

So there's a "default" pool that is used if the caller does not
specify a $pool argument. We can take more control via the way
we create pools in perl:

  Methods
  
  new ([$parent])
     Create a new pool. The pool is a root pool if $parent is not sup-
     plied.
  
  new_default ([$parent])
     Create a new pool. The pool is a root pool if $parent is not sup-
     plied.  Set the new pool as default pool.
  
  new_default_sub
     Create a new subpool of the current default pool, and set the
     resulting pool as new default pool.
  
  clear
     Clear the pool.
  
  destroy
     Destroy the pool. If the pool is the default pool, restore the pre-
     vious default pool as default. This is normally called automati-
     cally when the SVN::Pool object is no longer used and destroyed by
     the perl garbage collector.
  
So pools in perl get destroyed when the garbage collector runs.
The man page gives an example that implies that the garbage collector
runs when the pool falls out of scope:

  # create a root pool and set it as default pool for later use
  my $pool = SVN::Pool->new_default;

  sub something {
      # create a subpool of the current default pool
      my $pool = SVN::Pool->new_default_sub;
      # some svn operations...

      # $pool gets destroyed and the previous default pool
      # is restored when $pool's lexical scope ends
  }

git-svmimport uses the repository access (RA) layer of the subversion
library to talk to the repository. The SVN::Ra man page states
the following about the 'pool' argument of the SVN::Ra constructor:

  The pool for the ra session to use, and also the member functions
  will be called with this pool. Default to a newly created root
  pool.

The SVN::Ra man page fails to mention what the SVN::Core man page implies.
If a function gets passed an explicit pool, or if we change the
current default pool, our pool will be used instead of the one
specified to the constructor of the RA layer!

So the patch below does the following:

  * Create an explicit one-and-only root pool.

  * Override the default pool SVN::RA is using with our root pool.
    Since the connection to the repo must stay open during the whole
    script, the RA layer will depend on the pool staying alive throughout
    the whole program, so it might as well use the root pool.

  * Closely follow the example in the SVN::Core man page.

    Before calling a subversion function, create a subpool of our
    root pool and make it the new default pool. Then call the
    function without passing $pool argument. The function will use
    our subpool anyway because we made it the current default pool.
    The previous default pool will become default pool again when
    the subpool falls out of scope.

  * A major problem seemed to be that the script keeps creating
    new root pools with

      my $pool = SVN::Pool->new();
    
    inside a loop in *global* scope instead of using the loop idiom
    described above. These pools are never, ever destroyed.

    So create a subpool for the loop to use and clear (i.e. mark for
    reuse, don't decallocate) the subpool at the start of the loop
    instead of allocating new memory with each iteration.


With this patch, the script never exceeded the memory usage of firefox
to a large extend while converting the DSLinux repo :-)
I hope the patch will help even more stupid and ugly people switch to git.

diff --git a/git-svnimport.perl b/git-svnimport.perl
index b73d649..53526f4 100755
--- a/git-svnimport.perl
+++ b/git-svnimport.perl
@@ -54,6 +54,7 @@ my $branch_name = $opt_b || "branches";
 my $project_name = $opt_P || "";
 $project_name = "/" . $project_name if ($project_name);
 my $repack_after = $opt_R || 1000;
+my $root_pool = SVN::Pool->new_default;
 
 @ARGV == 1 or @ARGV == 2 or usage();
 
@@ -132,7 +133,7 @@ sub conn {
 	my $auth = SVN::Core::auth_open ([SVN::Client::get_simple_provider,
 			  SVN::Client::get_ssl_server_trust_file_provider,
 			  SVN::Client::get_username_provider]);
-	my $s = SVN::Ra->new(url => $repo, auth => $auth);
+	my $s = SVN::Ra->new(url => $repo, auth => $auth, pool => $root_pool);
 	die "SVN connection to $repo: $!\n" unless defined $s;
 	$self->{'svn'} = $s;
 	$self->{'repo'} = $repo;
@@ -147,11 +148,10 @@ sub file {
 
 	print "... $rev $path ...\n" if $opt_v;
 	my (undef, $properties);
-	my $pool = SVN::Pool->new();
 	$path =~ s#^/*##;
+	my $subpool = SVN::Pool::new_default_sub;
 	eval { (undef, $properties)
-		   = $self->{'svn'}->get_file($path,$rev,$fh,$pool); };
-	$pool->clear;
+		   = $self->{'svn'}->get_file($path,$rev,$fh); };
 	if($@) {
 		return undef if $@ =~ /Attempted to get checksum/;
 		die $@;
@@ -185,6 +185,7 @@ sub ignore {
 
 	print "... $rev $path ...\n" if $opt_v;
 	$path =~ s#^/*##;
+	my $subpool = SVN::Pool::new_default_sub;
 	my (undef,undef,$properties)
 	    = $self->{'svn'}->get_dir($path,$rev,undef);
 	if (exists $properties->{'svn:ignore'}) {
@@ -202,6 +203,7 @@ sub ignore {
 sub dir_list {
 	my($self,$path,$rev) = @_;
 	$path =~ s#^/*##;
+	my $subpool = SVN::Pool::new_default_sub;
 	my ($dirents,undef,$properties)
 	    = $self->{'svn'}->get_dir($path,$rev,undef);
 	return $dirents;
@@ -358,10 +360,9 @@ open BRANCHES,">>", "$git_dir/svn2git";
 
 sub node_kind($$) {
 	my ($svnpath, $revision) = @_;
-	my $pool=SVN::Pool->new;
 	$svnpath =~ s#^/*##;
-	my $kind = $svn->{'svn'}->check_path($svnpath,$revision,$pool);
-	$pool->clear;
+	my $subpool = SVN::Pool::new_default_sub;
+	my $kind = $svn->{'svn'}->check_path($svnpath,$revision);
 	return $kind;
 }
 
@@ -889,7 +890,7 @@ sub commit_all {
 	# Recursive use of the SVN connection does not work
 	local $svn = $svn2;
 
-	my ($changed_paths, $revision, $author, $date, $message, $pool) = @_;
+	my ($changed_paths, $revision, $author, $date, $message) = @_;
 	my %p;
 	while(my($path,$action) = each %$changed_paths) {
 		$p{$path} = [ $action->action,$action->copyfrom_path, $action->copyfrom_rev, $path ];
@@ -925,14 +926,14 @@ print "Processing from $current_rev to $opt_l ...\n" if $opt_v;
 my $from_rev;
 my $to_rev = $current_rev - 1;
 
+my $subpool = SVN::Pool::new_default_sub;
 while ($to_rev < $opt_l) {
+	$subpool->clear;
 	$from_rev = $to_rev + 1;
 	$to_rev = $from_rev + $repack_after;
 	$to_rev = $opt_l if $opt_l < $to_rev;
 	print "Fetching from $from_rev to $to_rev ...\n" if $opt_v;
-	my $pool=SVN::Pool->new;
-	$svn->{'svn'}->get_log("/",$from_rev,$to_rev,0,1,1,\&commit_all,$pool);
-	$pool->clear;
+	$svn->{'svn'}->get_log("/",$from_rev,$to_rev,0,1,1,\&commit_all);
 	my $pid = fork();
 	die "Fork: $!\n" unless defined $pid;
 	unless($pid) {
Attachment:
pgp9z5Zew4bjN.pgp

Description: PGP signature