git-svn has a _lot_ of metadata

Karl Hasselström <kha@xxxxxxxxxxx> · Tue, 16 Oct 2007 12:22:59 +0200

I just imported an svn repository with about 120 tags and 140
branches, and with some repacking got the pack file down to a
comfortable 80 MB. However, .git is over 600 MB, owing to about 520 MB
of git-svn metadata. (This wasn't a problem when I only tracked a
handful of branches, since they're only a few megs apiece.)

There appears to be two kinds of metadata that takes up a significant
fraction of the space.

  * An index file is saved for each branch and tag. I presume this
    corresponds to the branch head, and is used to speed up importing
    of new revisions to that branch. However, recreating an index with
    git-read-tree is very fast, so I don't think these need to be
    saved between git-svn runs.

  * A "rev_db" file is saved for each branch and tag. This is a text
    file with one sha1 per line -- I seem to remember that line X of
    this file is the commit sha1 of svn revision X. For revisions that
    didn't touch this branch/tag, there's a line of 40 zeros. And
    since every revision touches just one branch, it's almost all
    zeros unless the number of branches is very small.

    This could probably be stored _much_ more efficiently. Just
    gzipping it with the standard options shrinks it by between a
    factor of 4 (for one of the busiest branches) and 300 (for a tag,
    which is written just once). But I understand that we need quick
    random access here?

The index files should be easy enough to erase between runs, if they
indeed just correspond to the branch head. The rev_db files are
trickier; exactly what kind of lookups are required? Could it perhaps
be done with just one file, instead of one per branch/tag?

-- 
Karl Hasselström, kha@xxxxxxxxxxx
      www.treskal.com/kalle
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html