Re: git-svn: .git/svn disk usage

Eric Wong <normalperson@xxxxxxxx> · Wed, 5 Dec 2007 00:54:52 -0800

David Voit <david.voit@xxxxxxxxx> wrote:
> Ollie Wild <aaw <at> google.com> writes:
> 
> > 
> > Hi,
> > 
> > I've been using git-svn to mirror the gcc repository at
> > svn://gcc.gnu.org/svn/gcc.  Recently, I noticed that my .git directory
> > is consuming 11GB of disk space.  Digging further, I discovered that
> > 9.8GB of this is attributable to the .git/svn directory (which
> > includes 200 branches and 2,588 tags).  Given that my .git/objects
> > directory is 652MB, it seems that it ought to be possible to store
> > this information in a more compact form.
> > 
> > I'm curious if other developers have run into this issue.  If so, are
> > there any proposals / plans for improving the storage of git-svn
> > metadata?
> > 
> > Thanks,
> > Ollie
> > 
> 
> Hi all,
> 
> I've seen the same effect, so i tried to reduce the size of the revdb and made a
> new format:
> First, in the bin files the sha1 are stored as hexvalues not as ascii, this
> reduces the a single sha1 from 41 bytes to 20.
> Second, only save the non-zero commits, thats what the idx are used for.
> A idx file has three 32bit integers per entry.
> The first integer represents the first zero svn revision, the second the last
> zero revision and the last integer is the position of the next non-zero revision
> in the bin.
> 
> Example:
> Revision 0-373006 are zero revision and 373007 is the first actualy used revision
> and 373008-373623 are again zero revisions
> the idx has the following content:
> 0 373006 0
> 373007 373007 1
> 
> and the bin only saves
> 59037b8043268c9ca0d87ba86519ed0b5358c8a1
> eef3f7e25993a46e3c4242aa502d93e909b08c57

I'd very much like rev_db to be smaller, but I find the idea of the data
relying on a separate index too fragile and difficult to recover
from if corruption occurs (mainly for --no-metadata users).

The rev_db is simply a lookup for mapping SVN revision numbers to
git commit SHA1s.

I have an idea for a more compact .rev_db format:

  All records are 24 bytes:
    4 bytes for a 32-bit integer representing the SVN revision
    20 bytes for the git commit SHA1

  rev_db is an append-only format, so the 32-bit integer will be
  monotonically increasing over time, which allows:

  Lookups by revision number done via binary search:

  Which means empty revisions never need to be entered anymore.

Of course there needs to be a migration strategy for existing
repositories (mainly the ones using --no-metadata), too.

Users not using --no-metadata (nor the option for svk metadata) can just
remove their .rev_db* files and git-svn will automatically recreate them
as needed.

> The format currently used produce a 373624*41bytes large file.
> 
> Used on a git-svn clone here, i get:
> The results are:
> old:
> 1,1G    hadoop (1004M   svn/)
> new:
> 47M     hadoop (5,9M    svn/)

Very nice reduction!

> Here a example sourcecode to test this idea:
> 
> I try to integrate this in git-svn this week.
> 
> NOTE: I'm not a perl hacker, so use at your own risk.
> 
> Bye David
> ps.: I'm not a member of this list please reply directly to me.

If you don't have time, I'll try to implement my ideas sometime this
week or weekend (assuming I have time, too).

-- 
Eric Wong
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html