Nico's and my packv4 topic is available from my fastimport.git fork on repo.or.cz: gitweb: http://repo.or.cz/w/git/fastimport.git git: git://repo.or.cz/git/fastimport.git branch: sp/pack4 We have thus far reformatted OBJ_TREEs with a new dictionary based compression scheme. In this scheme we pool the filenames and modes that appear within trees into a single table within the packfile. All trees are then converted to use a 22 byte record format: - 2 byte network byte order index into the string pool - 20 byte SHA-1 These trees are then stored *uncompressed* within the packfile, but are also still stored using our standard delta system (only the deltas for these trees are also stored uncompressed). The resulting savings is pretty good; on linux-2.6.git we are saving ~3.8 MiB as a result of this encoding alone: 141649022 pack2-linuxA.git 137625761 pack4-linuxB.git read_sha1_file() has been modified to unpack this new tree format back into the canonical format; something that I think is very unncessary for runtime given how easy it is to iterate the encoded tree, but is still critically important for tools like git-cat-file, git-index-pack and git-verify-pack. Future plans are to iterate the encoded tree directly, but performance is already faster despite needing to reconvert the tree: lh=825020c3866e7312947e17a0caa9dd1a5622bafc git --git-dir=pack2-linux.git rev-list $lh -- include/asm-m68k 3.97 real 3.60 user 0.15 sys 3.98 real 3.60 user 0.15 sys 3.98 real 3.60 user 0.15 sys 3.98 real 3.60 user 0.15 sys 3.98 real 3.60 user 0.15 sys git --git-dir=pack4-linux.git rev-list $lh -- include/asm-m68k 3.52 real 3.17 user 0.13 sys 3.46 real 3.17 user 0.13 sys 3.51 real 3.17 user 0.13 sys 3.52 real 3.18 user 0.13 sys 3.53 real 3.16 user 0.13 sys I'll take 500 milliseconds savings anyday, thanks! :-) Nico and I have only started working on commits, so the above results still utilize the packv2 format for OBJ_COMMIT and do not take into account any of our proposed concepts there. The impetus for packv4 is to format the packfile in such a way that we can work with the data faster at runtime for common operations, like rev-list and its builtin path limiter. We also want to make reachability analysis (critical for packing and fsck) faster. Any reduction in storage size is considered a bonus here, though obviously there is some correlation between size of input data and the time required to process it. ;-) The patch series for this is getting large. Right now we are up to 32 patches in the series. Given where we are and where we want to go I'm predicting this series will come out at close to 100 patches. Of course that's partly because I'm working in fairly small units, slowly iterating the code into the final version we want. I am constantly rebasing the sp/pack4 topic noted above, so the patch count is not really because I'm going back and fixing things in later patches. Its because I'm trying to slowly iterate the runtime side of things in digestable changes, then the packing side, so that the system still works at every single commit in the series. Yes, its a *BIG* set of code changed. Obviously this series has a heavy hand on sha1_file.c, builtin-pack-objects.c, builtin-unpack-objects.c, index-pack.c. But it will also start to hit less obvious places like commit.c and tree-walk.c as we start to support walking the encoded objects directly. Given the huge size of the series, and the amount of effort we are tossing into it, and the fact that I'm trying to make it pu-ready by early next week, we would appreciate it if folks could keep changes to the above mentioned files limited to critical bug fixes only. :) -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html