Thanks, Stolee and Peff, for taking a look at it. Here is a v2. It is mostly unchanged, except for expanded commit messages and code comments. I've also added a documentation clarification that core.deltaBaseCacheLimit is per-thread, appearing as the first patch in this patch series. >From patch 3 (now patch 4): > > + int i; > > Technically this probably ought to be a size_t as well, but I'm much > more concerned about the allocation ones, where we might accidentally > overflow and underallocate a buffer. Overflowing "i" would probably just > lead to an error or bad result. I believe this needs to be signed, since we're iterating in reverse order, so I made it a ssize_t instead (note the extra "s" in front). >From patch 4 (now patch 5): > > Whenever we make a struct base_data, immediately calculate its delta > > children. This eliminates confusion as to when the > > {ref,ofs}_{first,last} fields are initialized. > > That _seems_ like a good idea, but I'm a little worried just because I > don't entirely understand why it was being done lazily before. If you've > puzzled all that out, it would be nice to make the argument in the > commit message. I've added an explanation in the commit message. Jonathan Tan (7): Documentation: deltaBaseCacheLimit is per-thread index-pack: unify threaded and unthreaded code index-pack: remove redundant parameter index-pack: remove redundant child field index-pack: calculate {ref,ofs}_{first,last} early index-pack: make resolve_delta() assume base data index-pack: make quantum of work smaller Documentation/config/core.txt | 2 +- builtin/index-pack.c | 446 ++++++++++++++++++---------------- 2 files changed, 242 insertions(+), 206 deletions(-) Range-diff against v1: -: ---------- > 1: 0a6777a243 Documentation: deltaBaseCacheLimit is per-thread 1: 7562afbaa9 = 2: b19d6131e0 index-pack: unify threaded and unthreaded code 2: a8567333dc ! 3: f01f069a08 index-pack: remove redundant parameter @@ Metadata ## Commit message ## index-pack: remove redundant parameter + find_{ref,ofs}_delta_{,children} take an enum object_type parameter, but + the object type is already present in the name of the function. Remove + that parameter from these functions. + Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> ## builtin/index-pack.c ## 3: 97eddde2ec ! 4: 3359b66b84 index-pack: remove redundant child field @@ Metadata ## Commit message ## index-pack: remove redundant child field - Instead, recompute ancestry if we ever need to reclaim memory. + This is refactoring 1 of 2 to simplify struct base_data. + + In index-pack, each thread maintains a doubly-linked list of the delta + chain that it is currently processing (the "base" and "child" pointers + in struct base_data). When a thread exceeds the delta base cache limit + and needs to reclaim memory, it uses the "child" pointers to traverse + the lineage, reclaiming the memory of the eldest delta bases first. + + A subsequent patch will perform memory reclaiming in a different way and + will thus no longer need the "child" pointer. Because the "child" + pointer is redundant even now, remove it so that the aforementioned + subsequent patch will be clearer. In the meantime, reclaim memory in the + reverse order of the "base" pointers. Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> @@ builtin/index-pack.c: static void free_base_data(struct base_data *c) - if (b->data && b != retain) - free_base_data(b); + struct base_data **ancestry = NULL; -+ int nr = 0, alloc = 0; -+ int i; ++ size_t nr = 0, alloc = 0; ++ ssize_t i; + + if (data->base_cache_used <= delta_base_cache_limit) + return; @@ builtin/index-pack.c: static void free_base_data(struct base_data *c) + for (b = youngest_child->base; b != NULL; b = b->base) { + ALLOC_GROW(ancestry, nr + 1, alloc); + ancestry[nr++] = b; -+ } + } + for (i = nr - 1; + i >= 0 && data->base_cache_used > delta_base_cache_limit; + i--) { + if (ancestry[i]->data) + free_base_data(ancestry[i]); - } ++ } + free(ancestry); } 4: 5d9687145d ! 5: 7f18480c45 index-pack: calculate {ref,ofs}_{first,last} early @@ Metadata ## Commit message ## index-pack: calculate {ref,ofs}_{first,last} early + This is refactoring 2 of 2 to simplify struct base_data. + Whenever we make a struct base_data, immediately calculate its delta children. This eliminates confusion as to when the {ref,ofs}_{first,last} fields are initialized. + Before this patch, the delta children were calculated at the last + possible moment. This allowed the members of struct base_data to be + populated in any order, superficially useful when we have the object + contents before the struct object_entry. But this makes reasoning about + the state of struct base_data more complicated, hence this patch. + Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> ## builtin/index-pack.c ## 5: ca4997017d = 6: 910b15219f index-pack: make resolve_delta() assume base data 6: 4f7c252a7c ! 7: 2f2e36d3ef index-pack: make quantum of work smaller @@ Metadata ## Commit message ## index-pack: make quantum of work smaller + Currently, when index-pack resolves deltas, it does not split up delta + trees into threads: each delta base root (an object that is not a + REF_DELTA or OFS_DELTA) can go into its own thread, but all deltas on + that root (direct or indirect) are processed in the same thread. + + This is a problem when a repository contains a large text file (thus, + delta-able) that is modified many times - delta resolution time during + fetching is dominated by processing the deltas corresponding to that + text file. + + This patch contains a solution to that. When cloning using + + git -c core.deltabasecachelimit=1g clone \ + https://fuchsia.googlesource.com/third_party/vulkan-cts + + on my laptop, clone time improved from 3m2s to 2m5s (using 3 threads, + which is the default). + + The solution is to have a global work stack. This stack contains delta + bases (objects, whether appearing directly in the packfile or generated + by delta resolution, that themselves have delta children) that need to + be processed; whenever a thread needs work, it peeks at the top of the + stack and processes its next unprocessed child. If a thread finds the + stack empty, it will look for more delta base roots to push on the stack + instead. + + The main weakness of having a global work stack is that more time is + spent in the mutex, but profiling has shown that most time is spent in + the resolution of the deltas themselves, so this shouldn't be an issue + in practice. In any case, experimentation (as described in the clone + command above) shows that this patch is a net improvement. + Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> ## builtin/index-pack.c ## @@ builtin/index-pack.c: struct base_data { struct object_entry *obj; int ref_first, ref_last; int ofs_first, ofs_last; ++ /* ++ * Threads should increment retain_data if they are about to call ++ * patch_delta() using this struct's data as a base, and decrement this ++ * when they are done. While retain_data is nonzero, this struct's data ++ * will not be freed even if the delta base cache limit is exceeded. ++ */ + int retain_data; ++ /* ++ * The number of direct children that have not been fully processed ++ * (entered work_head, entered done_head, left done_head). When this ++ * number reaches zero, this struct base_data can be freed. ++ */ + int children_remaining; /* Not initialized by make_base(). */ @@ builtin/index-pack.c: struct base_data { unsigned long size; }; ++/* ++ * Stack of struct base_data that have unprocessed children. ++ * threaded_second_pass() uses this as a source of work (the other being the ++ * objects array). ++ */ +LIST_HEAD(work_head); ++ ++/* ++ * Stack of struct base_data that have children, all of whom have been ++ * processed or are being processed, and at least one child is being processed. ++ * These struct base_data must be kept around until the last child is ++ * processed. ++ */ +LIST_HEAD(done_head); ++ ++/* ++ * All threads share one delta base cache. ++ */ +size_t base_cache_used; +size_t base_cache_limit; + @@ builtin/index-pack.c: static void free_base_data(struct base_data *c) - struct base_data *b; - struct thread_local *data = get_thread_data(); - struct base_data **ancestry = NULL; -- int nr = 0, alloc = 0; -- int i; +- size_t nr = 0, alloc = 0; +- ssize_t i; + struct list_head *pos; - if (data->base_cache_used <= delta_base_cache_limit) @@ builtin/index-pack.c: static void free_base_data(struct base_data *c) } static int is_delta_type(enum object_type type) +@@ builtin/index-pack.c: static void sha1_object(const void *data, struct object_entry *obj_entry, + } + + /* +- * This function is part of find_unresolved_deltas(). There are two +- * walkers going in the opposite ways. +- * +- * The first one in find_unresolved_deltas() traverses down from +- * parent node to children, deflating nodes along the way. However, +- * memory for deflated nodes is limited by delta_base_cache_limit, so +- * at some point parent node's deflated content may be freed. +- * +- * The second walker is this function, which goes from current node up ++ * Walk from current node up + * to top parent if necessary to deflate the node. In normal + * situation, its parent node would be already deflated, so it just + * needs to apply delta. @@ builtin/index-pack.c: static void *get_base_data(struct base_data *c) if (!delta_nr) { c->data = get_data_from_pack(obj); @@ builtin/index-pack.c: static int compare_ref_delta_entry(const void *a, const vo } -static void resolve_base(struct object_entry *obj) -+static void *do_work(void *data) - { +-{ - struct base_data *base_obj = make_base(obj, NULL); - - find_unresolved_deltas(base_obj); -} - --static void *threaded_second_pass(void *data) --{ + static void *threaded_second_pass(void *data) + { - set_thread_data(data); + if (data) + set_thread_data(data); @@ builtin/index-pack.c: static int compare_ref_delta_entry(const void *a, const vo - work_unlock(); - break; + if (list_empty(&work_head)) { ++ /* ++ * Take an object from the object array. ++ */ + while (nr_dispatched < nr_objects && + is_delta_type(objects[nr_dispatched].type)) + nr_dispatched++; @@ builtin/index-pack.c: static int compare_ref_delta_entry(const void *a, const vo + } + child_obj = &objects[nr_dispatched++]; + } else { ++ /* ++ * Peek at the top of the stack, and take a child from ++ * it. ++ */ + parent = list_first_entry(&work_head, struct base_data, + list); + @@ builtin/index-pack.c: static int compare_ref_delta_entry(const void *a, const vo + + if (parent->ref_first > parent->ref_last && + parent->ofs_first > parent->ofs_last) { ++ /* ++ * This parent has run out of children, so move ++ * it to done_head. ++ */ + list_del(&parent->list); + list_add(&parent->list, &done_head); + } + ++ /* ++ * Ensure that the parent has data, since we will need ++ * it later. ++ * ++ * NEEDSWORK: If parent data needs to be reloaded, this ++ * prolongs the time that the current thread spends in ++ * the mutex. A mitigating factor is that parent data ++ * needs to be reloaded only if the delta base cache ++ * limit is exceeded, so in the typical case, this does ++ * not happen. ++ */ + get_base_data(parent); + parent->retain_data++; } @@ builtin/index-pack.c: static int compare_ref_delta_entry(const void *a, const vo + if (parent) + parent->retain_data--; + if (child->data) { ++ /* ++ * This child has its own children, so add it to ++ * work_head. ++ */ + list_add(&child->list, &work_head); + base_cache_used += child->size; + prune_base_data(NULL); + } else { ++ /* ++ * This child does not have its own children. It may be ++ * the last descendant of its ancestors; free those ++ * that we can. ++ */ + struct base_data *p = parent; + + while (p) { @@ builtin/index-pack.c: static void resolve_deltas(void) if (nr_threads > 1 || getenv("GIT_FORCE_THREADS")) { init_thread(); for (i = 0; i < nr_threads; i++) { - int ret = pthread_create(&thread_data[i].thread, NULL, -- threaded_second_pass, thread_data + i); -+ do_work, thread_data + i); - if (ret) - die(_("unable to create thread: %s"), - strerror(ret)); -@@ builtin/index-pack.c: static void resolve_deltas(void) - cleanup_thread(); - return; - } -- threaded_second_pass(¬hread_data); -+ do_work(¬hread_data); - } - - /* @@ builtin/index-pack.c: static void fix_unresolved_deltas(struct hashfile *f) for (i = 0; i < nr_ref_deltas; i++) { struct ref_delta_entry *d = sorted_by_pos[i]; @@ builtin/index-pack.c: static void fix_unresolved_deltas(struct hashfile *f) - base->data = data; - base->size = size; - find_unresolved_deltas(base); ++ ++ /* ++ * Add this as an object to the objects array and call ++ * threaded_second_pass() (which will pick up the added ++ * object). ++ */ + append_obj_to_pack(f, d->oid.hash, data, size, type); -+ do_work(NULL); /* will pick up new object in objects array (added by append_obj_to_pack) */ ++ threaded_second_pass(NULL); ++ display_progress(progress, nr_resolved_deltas); } free(sorted_by_pos); -- 2.23.0.866.gb869b98d4c-goog