Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > On Thu, 9 Aug 2007, moe wrote: >> >> i made some tests on latest master branch >> (1.5.3.rc4.29.g74276) and it seems like git >> hits a wall somewhere above ~50k files. > > Good catch. Definitely not acceptable performance. > > We seem to spend a lot of our time in memcpy: [...] > Doing an ltrace on it shows tons and tons of: > > ... > strlen("35") > strlen("349") > calloc(1, 72) > memcpy(0x73034e, "10/", 3) > memcpy(0x730351, "349", 4) > memmove(0x2ab637f41e80, 0x2ab637f41e78, 781768) > ... > > but I haven't looked at where they come from yet. Ok, preaching to the pope here, but: moving memory is bad. Make sure data can stay where it starts. In particular: don't use realloc. And if you do, grow the size _exponentially_ (like a factor of 1.5). If you grow the size exponentially, at least the movery hits the algorithm with O(n lg n). If data stays put, even in badly scattered linear lists, we have O(n). If you grow realloc _linearly_ (constant size increment), then the algorithm is hit with O(n^2). Technically, basically _all_ operations of git can be done by doing list merges of presorted lists. The index can be kept sorted. All requests into the index can be collected in readdir order into one linear list, this gets sorted once using a merge sort with O(n lg n), then it gets merged with the index O(n+N). As long as the whole index is read in, it can't be done faster. It is not necessary to organize the read data into trees or more complicate structures: a single linear list is sufficient. One can use the hierarchical structure of a directory to shave off some part of the sorting cost, though, and it considerably will lessen memory impact (and copying costs) if a file/directory/tree entry can contain a relative file name and a "pointer to prefix" where the rest of the file path is to be found. Anyway, so much for some theory. Now let's look at bad points in the code, judging from your benchmarks. A grep for realloc is appaling. Let's see what is actually involved here. attr.c: struct git_attr *git_attr(const char *name, int len) { a->attr_nr = attr_nr++; git_attr_hash[pos] = a; check_all_attr = xrealloc(check_all_attr, sizeof(*check_all_attr) * attr_nr); [...] Full O(n^2) behavior of the worst kind (increment 1!). builtin-commit-tree.c: static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...) { size = *sizep; newsize = size + len + 1; alloc = (size + 32767) & ~32767; [size rounded to next 32k (inconsistent! needs to be size+1 rounded up)] buf = *bufp; if (newsize > alloc) { alloc = (newsize + 32767) & ~32767; [newsize rounded to next 32k] buf = xrealloc(buf, alloc); [O(n^2): constant increment. Important? No idea.] *bufp = buf; } *sizep = newsize - 1; memcpy(buf + size, one_line, len); } [...] int register_commit_graft(struct commit_graft *graft, int ignore_dups) { [...] if (commit_graft_alloc <= ++commit_graft_nr) { commit_graft_alloc = alloc_nr(commit_graft_alloc); commit_graft = xrealloc(commit_graft, sizeof(*commit_graft) * commit_graft_alloc); } if (pos < commit_graft_nr) memmove(commit_graft + pos + 1, commit_graft + pos, (commit_graft_nr - pos - 1) * sizeof(*commit_graft)); commit_graft[pos] = graft; return 0; } Eeek. Start with a linear list, not an array. objects.c: void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode) { unsigned nr = array->nr; unsigned alloc = array->alloc; struct object_array_entry *objects = array->objects; if (nr >= alloc) { alloc = (alloc + 32) * 2; objects = xrealloc(objects, alloc * sizeof(*objects)); array->alloc = alloc; Constant increment, O(n^2). pathlist.c: static int add_entry(struct path_list *list, const char *path) { int exact_match; int index = get_entry_index(list, path, &exact_match); if (exact_match) return -1 - index; if (list->nr + 1 >= list->alloc) { list->alloc += 32; list->items = xrealloc(list->items, list->alloc * sizeof(struct path_list_item)); } Constant increment, O(n^2). That's just a cursory examination. In my opinion, pretty much every realloc should be replaced by some sort of list structure. That would be the nicest thing. I have sped up some awk scripts that built up argument lists by a factor of 100 by replacing a[index] = a[index] " " thenewstuff with a[index,nr[index]++] = thenewstuff and then never concatenating the strings, but just outputting them in a loop. Anyway, short of that, don't realloc by fixed increments, always use alloc_nr as soon as multiple reallocs are to be expected. And they certainly are in some of the above cited code passages. -- David Kastrup - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html