Re: Repacking many disconnected blobs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 14 Jun 2006 00:17:58 -0700 Keith Packard wrote:

> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
> 
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects. This leaves the directories bloated, and operations
> within the tree quite sluggish. I'm importing a project with 30000 files
> and 30000 revisions (the CVS repository is about 700MB), and after
> scanning the files, and constructing (in memory) a complete revision
> history, the actual construction of the commits is happening at about 2
> per second, and about 70% of that time is in the kernel, presumably
> playing around in the repository.
> 
> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.

git-repack.sh basically does:

  git-rev-list --objects --all | git-pack-objects .tmp-pack

When you have only disconnected blobs, obviously the first part does
not work - git-rev-list cannot find these blobs.  However, you can do
that part manually - e.g., when you add a blob, do:

  fprintf(list_file, "%s %s\n", sha1, path);

(path should be a relative path in the repo without ",v" or "Attic" -
it is used for delta packing optimization, so getting it wrong will
not cause any corruption, but the pack may become significantly
larger).  You may output some duplicate sha1 values, but
git-pack-objects should handle duplicates correctly.

Then just invoke "git-pack-objects --non-empty .tmp_pack <list_file";
it will output the resulting pack sha1 to stdout.  Then you need to
move the pack into place and call git-prune-packed (which does not
use object lists, so it should work even with unreachable objects).

You may even want to repack more than once during the import;
probably the simplest way to do it is to truncate list_file after
each repack and use "git-pack-objects --incremental".

Attachment: pgpTpKloiCwcN.pgp
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]