git-pack-objects: cache small deltas between big objects --- On Mon, 14 May 2007, Nicolas Pitre wrote: > > I did some tests on differenent machines: > > > > - attached patch > > Total 6452 (delta 4581), reused 1522 (delta 0) > > 11354.38user 5451.60system 4:40:09elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k > > 0inputs+0outputs (0major+1371504762minor)pagefaults 0swaps > > =>75 MB pack size > > This is quite weird. I wonder what might cause such a large difference > in pack size. > > Your first patch is probably faster due to the use of memcmp() which is > certainly highly optimized, more than the comparison loop we have. It > is unfortunate that there is no library function to find the number of > identical bytes between two buffers. Or is there some? As far as I know, no. > But the size difference? That has certainly something to do with your > data set since your patch makes no significant difference on the git.git > nor the Linux kernel repos. Would it be possible for me to have a copy > of your repo for further analysis? I generate my repository by dumping a database. A script dumps a few tables with mysqldump into per table files and commits them. The dump file of each tables uses the one insert per line syntax, so nearly all lines of a file share a >=27 bytes prefix. A diffstat to the previous commit typically shows, that some hundred lines are added at the end of each file. A statistic of dropped hash table entries of the first thousand objects in create_delta_index shows, that up to 41 % are dropped, eg: Dropping: 3404731 of 8540915 (39.86 %) Dropping: 3397330 of 8525886 (39.85 %) Dropping: 3388813 of 8509648 (39.82 %) Dropping: 3381134 of 8494317 (39.80 %) Dropping: 3381128 of 8494294 (39.80 %) Dropping: 3375786 of 8483589 (39.79 %) Dropping: 3369725 of 8472206 (39.77 %) Dropping: 3364377 of 8460707 (39.76 %) Dropping: 3358120 of 8447813 (39.75 %) Dropping: 3351482 of 8435015 (39.73 %) Dropping: 3351481 of 8435007 (39.73 %) Dropping: 3351478 of 8435000 (39.73 %) Dropping: 3346193 of 8424055 (39.72 %) Dropping: 3339952 of 8410503 (39.71 %) Dropping: 3334324 of 8398253 (39.70 %) Dropping: 3370085 of 8384362 (40.19 %) Dropping: 3362979 of 8369151 (40.18 %) Dropping: 3354905 of 8353432 (40.16 %) So the current code will probably not always find the best match. As my last patch can match nearly the whole file after finding a match, the missing entries will not have a big influence. As a side effect, my last patch increased the total running time and minor page faults compared to the orignal version. I can not explain, why this happens. The deltifing phase needs to read and uncompress the same data. The only difference I can image, is, that different objects are selected as delta base for the writing phase, which require more reading time (eg. because they are bigger or have a longer delta chain in their current pack file). As a large amount of CPU time is spent in writing the pack file (reading all blobs again and applying the delta chain, computing the delta_index and recomputing the delta) and the deltas are small, I tried to cache the deltas from try_delta, if the compared blobs are big (and therefore the delta operation is expensive): - my last patch + this patch Total 6452 (delta 4581), reused 1522 (delta 0) 4176.04user 322.10system 1:14:58elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+83869085minor)pagefaults 0swaps => 75MB builtin-pack-objects.c | 35 +++++++++++++++++++++++++---------- 1 files changed, 25 insertions(+), 10 deletions(-) diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 966f843..fe19272 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -35,6 +35,7 @@ struct object_entry { struct object_entry *delta_sibling; /* other deltified objects who * uses the same base as me */ + void *delta_data; /* cached delta (uncompressed) */ unsigned long delta_size; /* delta data size (uncompressed) */ enum object_type type; enum object_type in_pack_type; /* could be delta */ @@ -445,17 +446,24 @@ static unsigned long write_object(struct sha1file *f, } if (!to_reuse) { - buf = read_sha1_file(entry->sha1, &type, &size); - if (!buf) - die("unable to read %s", sha1_to_hex(entry->sha1)); - if (size != entry->size) - die("object %s size inconsistency (%lu vs %lu)", - sha1_to_hex(entry->sha1), size, entry->size); - if (entry->delta) { - buf = delta_against(buf, size, entry); + if (entry->delta_data) { + buf = entry->delta_data; size = entry->delta_size; obj_type = (allow_ofs_delta && entry->delta->offset) ? - OBJ_OFS_DELTA : OBJ_REF_DELTA; + OBJ_OFS_DELTA : OBJ_REF_DELTA; + } else { + buf = read_sha1_file(entry->sha1, &type, &size); + if (!buf) + die("unable to read %s", sha1_to_hex(entry->sha1)); + if (size != entry->size) + die("object %s size inconsistency (%lu vs %lu)", + sha1_to_hex(entry->sha1), size, entry->size); + if (entry->delta) { + buf = delta_against(buf, size, entry); + size = entry->delta_size; + obj_type = (allow_ofs_delta && entry->delta->offset) ? + OBJ_OFS_DELTA : OBJ_REF_DELTA; + } } /* * The object header is a byte of 'type' followed by zero or @@ -1359,10 +1367,17 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, if (!delta_buf) return 0; + if (trg_entry->delta_data) + free (trg_entry->delta_data); + trg_entry->delta_data = 0; trg_entry->delta = src_entry; trg_entry->delta_size = delta_size; trg_entry->depth = src_entry->depth + 1; - free(delta_buf); + /* cache delta, if objects are large enough compared to delta size */ + if ((src_size >> 20) + (trg_size >> 21) > (delta_size >> 10)) + trg_entry->delta_data = delta_buf; + else + free(delta_buf); return 1; } -- 1.5.1.4.g01b3 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html