I would declare that the earlier parts of the v2 that are about factoring out various API pieces from existing code are basically completed, so they are not part of this iteration. The bulk-checkin patch from v2 has been tweaked a bit (deflate_to_pack() initializes "already_hashed_to" pointer to 0, instead of the current file position "seekback"), and then the rest of the series builds on top of it to add a new in-pack encoding that I am tentatively calling "chunked". The basic idea is to represent a large/huge blob as a concatenation of smaller blobs. An entry in a pack in "chunked" representation records a list of object names of the component blob objects. The object name given to such a blob is computed exactly the same way as before. In other words, the name of a object does not depend on its representation; we hash "blob <size> NUL" and the whole large blob contents to come up with its name. It is *not* the hash of the component blob object names. As can be seen in the log message of the "support chunked-object encoding" patch, many pieces are still missing from this series and filling them will be a long and tortuous journey. But we need to start somewhere. I specifically excluded any heuristics to split large objects into chunks in a self-synchronising way so that a small edit near the beginning of a large blob results in a handful of new component blobs followed by the same component blobs as used to represent the same blob before such an edit, and I do not plan to work on that part myself. My impression from listening Avery's plug for "bup" is that it is a solved problem; it should be reasonably straightforward to lift the logic and plug it into the framework presented here (once the codebase gets solid enough, that is). After this series, the next step for me is likely to teach the streaming interface about "chunked" objects, and then pack-objects to take notice and reuse "chunked" representation when sending things out (which means that sending a "chunked" encoded blob would involve sending the component blobs it uses, among other things), but I expect that it will extend well into next year. Junio C Hamano (6): bulk-checkin: replace fast-import based implementation varint-in-pack: refactor varint encoding/decoding new representation types in the packstream bulk-checkin: allow the same data to be multiply hashed bulk-checkin: support chunked-object encoding chunked-object: fallback checkout codepaths Makefile | 3 + builtin/add.c | 5 + builtin/pack-objects.c | 34 ++--- bulk-checkin.c | 415 ++++++++++++++++++++++++++++++++++++++++++++++++ bulk-checkin.h | 17 ++ cache.h | 13 ++- config.c | 9 + environment.c | 2 + pack-write.c | 50 +++++- pack.h | 2 + sha1_file.c | 150 +++++++++--------- split-chunk.c | 28 ++++ t/t1050-large.sh | 135 +++++++++++++++- zlib.c | 9 +- 14 files changed, 760 insertions(+), 112 deletions(-) create mode 100644 bulk-checkin.c create mode 100644 bulk-checkin.h create mode 100644 split-chunk.c -- 1.7.8.rc4.177.g4d64 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html