Re: Creating objects manually and repack

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/5/06, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
I'm almost done with what I'm calling `git-fast-import`.  It takes
a stream of blobs on STDIN and writes the pack to a file, printing
SHA1s in hex format to STDOUT.  The basic format for STDIN is a 4
byte length (native format) followed by that many bytes of blob data.
It prints the SHA1 for that blob to STDOUT, then waits for another
length.

It naively deltas each object against the prior object, thus it
would be best to feed it one ,v file at a time working from the most
recent revision back to the oldest revision.  This works well for
an RCS file as that's the natural order to process the file in.  :-)

I am already doing this.

When done you close STDIN and it'll rip through and update the pack
object count and the trailing checksum.  This should let you pack
the entire repository in delta format using only two passes over the
data: one to write out the pack file and one to compute its checksum.

Thinking about this some more, the existing repack code could be made
to work with minor changes. I would like to feed repack 1M revisions
which are sorted by file and then newest to oldest. The problem is
that my expanded revs take up 12GB disk space.

How about adding a flag to repack that simply says delete the objects
when done with them? I'd still create all of the objects on disk.
Repack would assume that they have at least been sorted by filename.
So repack could read in object names until it sees a change in the
file name, sort them by size, deltafy, write out the pack and then
delete the objects from that batch. Then repeat this process for the
next file name on stdin.

I'm making two assumptions, first that blocks from a deleted file
don't get written to disk. And that by deleting the file the file
system will use the same blocks over and over. If those assumptions
are close to being true then the cache shouldn't thrash. They don't
have to be totally true, close is good enough.

Of course eliminating the files all together will be even faster.

--
Jon Smirl
jonsmirl@xxxxxxxxx
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]