On Mon, Aug 13, 2018 at 3:14 AM, Ramsay Jones <ramsay@xxxxxxxxxxxxxxxxxxxx> wrote: > On 12/08/18 06:11, Christian Couder wrote: >> Because the inefficiency primarily arises when an >> object is delitified against another object that does > > s/delitified/deltified/ ? Ok, this will be in the next reroll if any. >> not exist in the same fork, we partition objects into >> sets that appear in the same fork, and define >> "delta islands". When finding delta base, we do not >> allow an object outside the same island to be >> considered as its base. >> --- /dev/null >> +++ b/delta-islands.c >> @@ -0,0 +1,493 @@ >> +#include "cache.h" >> +#include "attr.h" >> +#include "object.h" >> +#include "blob.h" >> +#include "commit.h" >> +#include "tag.h" >> +#include "tree.h" >> +#include "delta.h" >> +#include "pack.h" >> +#include "tree-walk.h" >> +#include "diff.h" >> +#include "revision.h" >> +#include "list-objects.h" >> +#include "progress.h" >> +#include "refs.h" >> +#include "khash.h" > > I was wondering how many copies of the inline functions > introduced by this header we had, so: > > $ nm git | grep ' t ' | cut -d' ' -f3 | sort | uniq -c | sort -rn | grep kh_ > 3 kh_resize_sha1 > 3 kh_put_sha1 > 3 kh_init_sha1 > 3 kh_get_sha1 > 1 kh_resize_str > 1 kh_resize_sha1_pos > 1 kh_put_str > 1 kh_put_sha1_pos > 1 kh_init_str > 1 kh_init_sha1_pos > 1 kh_get_str > 1 kh_get_sha1_pos > 1 kh_destroy_sha1 > $ > > Looking at the individual object files, we see: > > $ nm pack-bitmap-write.o | grep ' t ' | grep kh_ > 00000000000001cc t kh_get_sha1 > 00000000000001b7 t kh_init_sha1 > 000000000000085e t kh_put_sha1 > 0000000000000310 t kh_resize_sha1 > $ > > So, the two instances of the sha1 hash-map are never > destroyed (kh_destroy_sha1 is not present in the object > file). This is interesting (even though it seems related to more code than the current patch series). As those hash maps are in 'struct bitmap_writer' and a static instance is used: static struct bitmap_writer writer; it maybe ok. > $ nm pack-bitmap.o | grep ' t ' | grep kh_ > 00000000000002d9 t kh_destroy_sha1 > 000000000000032b t kh_get_sha1 > 0000000000000daa t kh_get_sha1_pos > 00000000000002c4 t kh_init_sha1 > 0000000000000d95 t kh_init_sha1_pos > 00000000000009bd t kh_put_sha1 > 0000000000001432 t kh_put_sha1_pos > 000000000000046f t kh_resize_sha1 > 0000000000000eee t kh_resize_sha1_pos > $ > > The sha1_pos hash-map is not destroyed here. Yeah, maybe a line like: kh_destroy_pos(b->ext_index.positions); is missing from free_bitmap_index()? Adding that should be in a separate patch from this series though. > $ nm delta-islands.o | grep ' t ' | grep kh_ > 00000000000002be t kh_get_sha1 > 0000000000000e52 t kh_get_str > 00000000000002a9 t kh_init_sha1 > 0000000000000e3d t kh_init_str > 0000000000000950 t kh_put_sha1 > 00000000000014e4 t kh_put_str > 0000000000000402 t kh_resize_sha1 > 0000000000000f96 t kh_resize_str > $ > > And neither the sha1 or str hash-maps are destroyed here. > (That is not necessarily a problem, of course! ;-) ) The instances are declared as static: static khash_sha1 *island_marks; static kh_str_t *remote_islands; so it maybe ok. >> +struct island_bitmap { >> + uint32_t refcount; >> + uint32_t bits[]; > > Use FLEX_ARRAY here? We are slowly moving toward requiring > certain C99 features, but I can't remember a flex array > weather-balloon patch. This was already discussed by Junio and Peff there: https://public-inbox.org/git/20180727130229.GB18599@xxxxxxxxxxxxxxxxxxxxx/ >> +}; >> +int in_same_island(const struct object_id *trg_oid, const struct object_id *src_oid) > > Hmm, what does the trg_ prefix stand for? > >> +{ >> + khiter_t trg_pos, src_pos; >> + >> + /* If we aren't using islands, assume everything goes together. */ >> + if (!island_marks) >> + return 1; >> + >> + /* >> + * If we don't have a bitmap for the target, we can delta it > > ... Ah, OK, trg_ => target. I am ok to replace "trg" with "target" (or maybe "dst"? or something else) and "src" with "source" if you think it would make things clearer. >> +static void add_ref_to_island(const char *island_name, const struct object_id *oid) >> +{ >> + uint64_t sha_core; >> + struct remote_island *rl = NULL; >> + >> + int hash_ret; >> + khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret); >> + >> + if (hash_ret) { >> + kh_key(remote_islands, pos) = xstrdup(island_name); >> + kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct remote_island)); >> + } >> + >> + rl = kh_value(remote_islands, pos); >> + oid_array_append(&rl->oids, oid); >> + >> + memcpy(&sha_core, oid->hash, sizeof(uint64_t)); >> + rl->hash += sha_core; > > Hmm, so the first 64-bits of the oid of each ref that is part of > this island is added together as a 'hash' for the island. And this > is used to de-duplicate the islands? Any false positives? (does it > matter - it would only affect performance, not correctness, right?) I would think that a false positive from pure chance is very unlikely. We would need to approach billions of delta islands (as 2 to the power 64/2 is in the order of billions) for the probability to be significant. GitHub has less than 50 millions users and it is very unlikely that a significant proportion of these users will fork the same repo. Now if there is a false positive because two forks have exactly the same refs, then it is not a problem if they are considered the same, because they are actually the same. >> +} > Sorry, I spent so long reading this patch, I have run out of > time tonight (and I am busy tomorrow) to read the rest of the > series. Thank you for your interesting review, Christian.