Re: Repacking many disconnected blobs

Shawn Pearce <spearce@xxxxxxxxxxx> · Wed, 14 Jun 2006 03:29:24 -0400

Keith Packard <keithp@xxxxxxxxxx> wrote:
> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
> 
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects. This leaves the directories bloated, and operations
> within the tree quite sluggish. I'm importing a project with 30000 files
> and 30000 revisions (the CVS repository is about 700MB), and after
> scanning the files, and constructing (in memory) a complete revision
> history, the actual construction of the commits is happening at about 2
> per second, and about 70% of that time is in the kernel, presumably
> playing around in the repository.
> 
> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.

What about running git-update-index using .git/objects as the
current working directory and adding all files in ??/* into the
index, then git-write-tree that index and git-commit-tree the tree.

When you are done you have a bunch of orphan trees and a commit
but these shouldn't be very big and I'd guess would prune out with
a repack if you don't hold a ref to the orphan commit.

-- 
Shawn.
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html