On 8/5/06, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
I'm almost done with what I'm calling `git-fast-import`. It takes a stream of blobs on STDIN and writes the pack to a file, printing SHA1s in hex format to STDOUT. The basic format for STDIN is a 4 byte length (native format) followed by that many bytes of blob data. It prints the SHA1 for that blob to STDOUT, then waits for another length. It naively deltas each object against the prior object, thus it would be best to feed it one ,v file at a time working from the most recent revision back to the oldest revision. This works well for an RCS file as that's the natural order to process the file in. :-)
I am already doing this.
When done you close STDIN and it'll rip through and update the pack object count and the trailing checksum. This should let you pack the entire repository in delta format using only two passes over the data: one to write out the pack file and one to compute its checksum.
Thinking about this some more, the existing repack code could be made to work with minor changes. I would like to feed repack 1M revisions which are sorted by file and then newest to oldest. The problem is that my expanded revs take up 12GB disk space. How about adding a flag to repack that simply says delete the objects when done with them? I'd still create all of the objects on disk. Repack would assume that they have at least been sorted by filename. So repack could read in object names until it sees a change in the file name, sort them by size, deltafy, write out the pack and then delete the objects from that batch. Then repeat this process for the next file name on stdin. I'm making two assumptions, first that blocks from a deleted file don't get written to disk. And that by deleting the file the file system will use the same blocks over and over. If those assumptions are close to being true then the cache shouldn't thrash. They don't have to be totally true, close is good enough. Of course eliminating the files all together will be even faster. -- Jon Smirl jonsmirl@xxxxxxxxx - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html