On 4/9/07, Nicolas Pitre <nico@xxxxxxx> wrote:
On Mon, 9 Apr 2007, Dana How wrote: > Wouldn't the following address the "object count unknown > at the start of sequential pack writing" problem: > Write 0 for object count in the header. This is a flag to look for > another header of same format just before the final SHA-1 which > has the correct count. The SHA-1 is still a checksum of everything > before it and no seeking/rewriting is needed on generation. No. You really wants to know up front how many objects a pack contains when streaming it. And this is not only for packs written to stdout.
OK, let me ask a dumb question and flog one last additional obvious idea. Does your wanting to know stem from more than wanting to stick to one malloc of all the object info at once? My suggestion quoted above is actually a change to the .pack format. With all the other ideas for .pack format changes floating around, let me withdraw that and suggest a simpler one: write a "0" in the header, and terminate the pack with a sentinel in object format before the final SHA-1s. The sentinel would be type=OBJ_NONE/length=0, i.e. a null byte. "Not much" would need to be updated to tolerate it and you could count objects while looking for it (if header has 0) during normal processing. (I'm reacting to your word "streaming".)
> Finally, when I generate several 2GB split packfiles, I do notice > the slight delay for fixup_header_footer(), and I do think it's a bit > ugly, but in quantitative terms it's an insignificant part of a long > operation that's infrequently performed. Does this need to be > optimized at all? Maybe, maybe not. That depends how much data we think GIT could be used to manage in the future. With a 1TB pack file you definitely want to optimize that case.
OK. Just FYI, we have a perforce repository near 200GB and this is not what would concern me right now if we converted all or part of it to git. Of course that would depend on the packing schedule.
OTOH this could wait for the real pack v4 too.
Makes sense to me. The fewer format changes the better. BTW, I've caught up on reading the mailing list archives, but I don't recall seeing any overview of the objectives of pack v2/v3/v4. Does that exist any where? I didn't see it in Documentation or Documentation/technical. It would probably reduce uninformed questions like the above. I've deduced rationales for what miscellaneous details I have seen, except moving the SHA-1s from .idx to .pack (?). Thanks, -- Dana L. How danahow@xxxxxxxxx +1 650 804 5991 cell - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html