Traditionally, git always read the full contents of an object in memory before performing various operations on it, e.g. comparing for diff, writing it to the working tree, etc. A huge blob that you cannot fit in memory was very cumbersome to handle. Recently "diff" learned to avoid reading the contents only to say "Binary files differ" when these large blobs are marked as binary. Also there is a topic cooking to teach "git add" to stream a large file directly to a packfile without keeping the whole thing in core. The "checkout" codepath is to learn the trick next, and this is the series to attempt to do so. These would apply cleanly on top of three other topics still in 'next' or 'pu', namely: - jc/convert that cleans up the conversion; - jc/replacing that cleans up the object replacement; - jc/bigfile that teaches "git add" to handle large files. Patch 1 and 5 are trivial clean-ups and refactoring. These could be separated out of the series and applied much earlier, but nothing other than this series directly benefit from these changes, so they are here in the series. Patch 2, 3, and 4 enhances the sha1_file layer. Patch 6 introduces a new API that takes an object name and gives back a "handle" you can read from (think: FILE *) the contents of the object. The implementation at this step is deliberately kept simple: it just calls read_sha1_file() to read everything in memory. Patch 7 then uses the new API in the "git checkout" codepath, namely, in entry.c::write_entry() function. At this point, any blob that does not require smudge filters including crlf conversion would pass through this new codepath and used the 'incore' case of the streaming API, which means that (1) "hold everything in memory and process" limitation is not lifted yet, and that (2) breakage detected in here would have meant either the simple 'incore' implementation of the streaming API is broken (not likely), or its caller streaming_write_entry() is broken (more likely). Patch 8 teaches the new write-out codepath to detect and make holes in the resulting file. This is primarily meant to help testing---when you add a large test file that weighs 1GB with "git add" (see how it is done in the test t/t1050-large.sh on jc/bigfile topic) and check it out, you do not want to end up with 1GB file fully populated with real blocks in your working tree. Patch 9 teaches the streaming API how to read a non-delta object directly from packfile, without holding the entire result in the memory. This is the representation jc/bigfile topic creates for a huge file, and the primary interest of this topic. Patch 10 and 11 teaches the streaming API how to read a loose object, without holding the entire result in the memory. This is not strictly necessary for the purpose of handling the output from jc/bigfile, but not having to hold everything in core by itself may be a plus. Interested parties may want to measure the performance impact of the last three patches. The series deliberately ignores core.bigfileThreashold and let small and large blobs alike go through the streaming_write_entry() codepath, but it _might_ turn out that we would want to use the new code only for large-ish blobs. Junio C Hamano (11): packed_object_info_detail(): do not return a string sha1_object_info_extended(): expose a bit more info sha1_object_info_extended(): hint about objects in delta-base cache unpack_object_header(): make it public write_entry(): separate two helper functions out streaming: a new API to read from the object store streaming_write_entry(): use streaming API in write_entry() streaming_write_entry(): support files with holes streaming: read non-delta incrementally from a pack sha1_file.c: expose helpers to read loose objects streaming: read loose objects incrementally Makefile | 2 + builtin/verify-pack.c | 4 +- cache.h | 36 +++++- convert.c | 23 +++ entry.c | 111 ++++++++++++--- sha1_file.c | 71 ++++++++-- streaming.c | 376 +++++++++++++++++++++++++++++++++++++++++++++++++ streaming.h | 12 ++ 8 files changed, 600 insertions(+), 35 deletions(-) create mode 100644 streaming.c create mode 100644 streaming.h -- 1.7.5.1.365.g32b65 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html