[RFC] Optimize diff-delta.c

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I try to use git with large blobs. Putting such blobs into pack files
is a slow operation and requires lots of memory. So I take a look at
the packing process.

As the delta format only supports 32 bit offsets, the uncompressed
blob size is limited to 4GB.

The delta index has approximately the same size in memory as the
uncompressed blob ((blob size)/16*(sizeof(index_entry)).
git-pack-objects keep the delta index of all objects in the search
window in memory.

So doing a delta of 4 GB files is totally unrealistic:
 (4GB data + ~4GB index)* window size [default: 10]= 80 GB in the worst case

In my case, the blobs are some hundred MB big, but git-pack-objects
already uses some GB of memory. As the memory requirement of
git-pack-objects is currently below the available memory of my system,
I need not to address this issue yet.

In the future, I'll propably need to create a patch to free big delta
indexes in find_delta immediatly, after create_delta returned. This
will increase the processing time, but better than not being able to 
pack objects.

I tried to speed up the delta generation by searching for a common
prefix, as my blobs are mostly append only. I tested it with about
less than 1000 big blobs. The time for finding the deltas decreased
from 17 to 14 minutes cpu time.

When repacking the git-repostiory itself, I get the following numbers:

Unmodified version (gcc-4.1):
$ echo | time ./git-pack-objects --non-empty --all --reflog --unpacked=pack-d44dc76d0e873a7c7566bcc4503731b9a5640b30.pack --no-reuse-delta  .git/.tmp-28449-pack
Generating pack...
Done counting 42553 objects.
Deltifying 42553 objects...
 100% (42553/42553) done
Writing 42553 objects...
 100% (42553/42553) done
d44dc76d0e873a7c7566bcc4503731b9a5640b30
Total 42553 (delta 29605), reused 12346 (delta 0)
63.82user 0.80system 1:06.64elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+29177minor)pagefaults 0swaps

Patched version (gcc-4.1):
$ echo | time ../git/git-pack-objects --non-empty --all --reflog --unpacked=pack-d44dc76d0e873a7c7566bcc4503731b9a5640b30.pack --no-reuse-delta  .git/.tmp-28448-pack
Generating pack...
Done counting 42553 objects.
Deltifying 42553 objects...
 100% (42553/42553) done
Writing 42553 objects...
 100% (42553/42553) done
d44dc76d0e873a7c7566bcc4503731b9a5640b30
Total 42553 (delta 29581), reused 12353 (delta 0)
62.13user 0.91system 1:05.07elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+39690minor)pagefaults 0swaps

So it can help improve the performance a little bit (62.13+0.91<->63.82+0.80) on normal
repositories.

The following patch is only for testing purposes and not cleaned up.

mfg Martin Kögler

 diff-delta.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/diff-delta.c b/diff-delta.c
index 9f998d0..4f29eb7 100644
--- a/diff-delta.c
+++ b/diff-delta.c
@@ -250,6 +250,8 @@ create_delta(const struct delta_index *index,
 	int inscnt;
 	const unsigned char *ref_data, *ref_top, *data, *top;
 	unsigned char *out;
+	const unsigned char* d, *dt, *t, *td;
+	unsigned int psize;
 
 	if (!trg_buf || !trg_size)
 		return NULL;
@@ -283,6 +285,66 @@ create_delta(const struct delta_index *index,
 	data = trg_buf;
 	top = (const unsigned char *) trg_buf + trg_size;
 
+	psize = 0;
+	d=data;
+	dt=top-RABIN_WINDOW;
+	t=ref_data;
+	td=ref_top-RABIN_WINDOW;
+	while (d<dt && t<td && !memcmp (d, t, RABIN_WINDOW))
+	{
+	    psize+=RABIN_WINDOW;
+	    d+=RABIN_WINDOW;
+	    td+=RABIN_WINDOW;
+	}
+	while (psize)
+	{
+	    unsigned int size=psize;
+	    unsigned int offset=0, moff;
+	    unsigned char *op;
+
+	    if(size>0xFFFFFF)
+		size=0xFFFFFF;
+
+
+	    data+=size;
+	    moff=offset;
+	    offset+=size;
+	    psize-=size;
+
+	    op = out + outpos++;
+	    i = 0x80;
+
+	    if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
+	    moff >>= 8;
+	    if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
+	    moff >>= 8;
+	    if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
+	    moff >>= 8;
+	    if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
+	    
+	    if (size & 0xff) { out[outpos++] = size; i |= 0x10; }
+	    size >>= 8;
+	    if (size & 0xff) { out[outpos++] = size; i |= 0x20; }
+	    size >>= 8;
+	    if (size & 0xff) { out[outpos++] = size; i |= 0x40; }
+
+	    *op = i;
+
+		if (outpos >= outsize - MAX_OP_SIZE) {
+			void *tmp = out;
+			outsize = outsize * 3 / 2;
+			if (max_size && outsize >= max_size)
+				outsize = max_size + MAX_OP_SIZE + 1;
+			if (max_size && outpos > max_size)
+				break;
+			out = xrealloc(out, outsize);
+			if (!out) {
+				free(tmp);
+				return NULL;
+			}
+		}
+	}
+
 	outpos++;
 	val = 0;
 	for (i = 0; i < RABIN_WINDOW && data < top; i++, data++) {
-- 
1.4.4.4

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]