I try to use git with large blobs. Putting such blobs into pack files is a slow operation and requires lots of memory. So I take a look at the packing process. As the delta format only supports 32 bit offsets, the uncompressed blob size is limited to 4GB. The delta index has approximately the same size in memory as the uncompressed blob ((blob size)/16*(sizeof(index_entry)). git-pack-objects keep the delta index of all objects in the search window in memory. So doing a delta of 4 GB files is totally unrealistic: (4GB data + ~4GB index)* window size [default: 10]= 80 GB in the worst case In my case, the blobs are some hundred MB big, but git-pack-objects already uses some GB of memory. As the memory requirement of git-pack-objects is currently below the available memory of my system, I need not to address this issue yet. In the future, I'll propably need to create a patch to free big delta indexes in find_delta immediatly, after create_delta returned. This will increase the processing time, but better than not being able to pack objects. I tried to speed up the delta generation by searching for a common prefix, as my blobs are mostly append only. I tested it with about less than 1000 big blobs. The time for finding the deltas decreased from 17 to 14 minutes cpu time. When repacking the git-repostiory itself, I get the following numbers: Unmodified version (gcc-4.1): $ echo | time ./git-pack-objects --non-empty --all --reflog --unpacked=pack-d44dc76d0e873a7c7566bcc4503731b9a5640b30.pack --no-reuse-delta .git/.tmp-28449-pack Generating pack... Done counting 42553 objects. Deltifying 42553 objects... 100% (42553/42553) done Writing 42553 objects... 100% (42553/42553) done d44dc76d0e873a7c7566bcc4503731b9a5640b30 Total 42553 (delta 29605), reused 12346 (delta 0) 63.82user 0.80system 1:06.64elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+29177minor)pagefaults 0swaps Patched version (gcc-4.1): $ echo | time ../git/git-pack-objects --non-empty --all --reflog --unpacked=pack-d44dc76d0e873a7c7566bcc4503731b9a5640b30.pack --no-reuse-delta .git/.tmp-28448-pack Generating pack... Done counting 42553 objects. Deltifying 42553 objects... 100% (42553/42553) done Writing 42553 objects... 100% (42553/42553) done d44dc76d0e873a7c7566bcc4503731b9a5640b30 Total 42553 (delta 29581), reused 12353 (delta 0) 62.13user 0.91system 1:05.07elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+39690minor)pagefaults 0swaps So it can help improve the performance a little bit (62.13+0.91<->63.82+0.80) on normal repositories. The following patch is only for testing purposes and not cleaned up. mfg Martin Kögler diff-delta.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/diff-delta.c b/diff-delta.c index 9f998d0..4f29eb7 100644 --- a/diff-delta.c +++ b/diff-delta.c @@ -250,6 +250,8 @@ create_delta(const struct delta_index *index, int inscnt; const unsigned char *ref_data, *ref_top, *data, *top; unsigned char *out; + const unsigned char* d, *dt, *t, *td; + unsigned int psize; if (!trg_buf || !trg_size) return NULL; @@ -283,6 +285,66 @@ create_delta(const struct delta_index *index, data = trg_buf; top = (const unsigned char *) trg_buf + trg_size; + psize = 0; + d=data; + dt=top-RABIN_WINDOW; + t=ref_data; + td=ref_top-RABIN_WINDOW; + while (d<dt && t<td && !memcmp (d, t, RABIN_WINDOW)) + { + psize+=RABIN_WINDOW; + d+=RABIN_WINDOW; + td+=RABIN_WINDOW; + } + while (psize) + { + unsigned int size=psize; + unsigned int offset=0, moff; + unsigned char *op; + + if(size>0xFFFFFF) + size=0xFFFFFF; + + + data+=size; + moff=offset; + offset+=size; + psize-=size; + + op = out + outpos++; + i = 0x80; + + if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; } + moff >>= 8; + if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; } + moff >>= 8; + if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; } + moff >>= 8; + if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; } + + if (size & 0xff) { out[outpos++] = size; i |= 0x10; } + size >>= 8; + if (size & 0xff) { out[outpos++] = size; i |= 0x20; } + size >>= 8; + if (size & 0xff) { out[outpos++] = size; i |= 0x40; } + + *op = i; + + if (outpos >= outsize - MAX_OP_SIZE) { + void *tmp = out; + outsize = outsize * 3 / 2; + if (max_size && outsize >= max_size) + outsize = max_size + MAX_OP_SIZE + 1; + if (max_size && outpos > max_size) + break; + out = xrealloc(out, outsize); + if (!out) { + free(tmp); + return NULL; + } + } + } + outpos++; val = 0; for (i = 0; i < RABIN_WINDOW && data < top; i++, data++) { -- 1.4.4.4 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html