On Thu, Sep 6, 2018 at 11:03 AM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Stefan Beller <sbeller@xxxxxxxxxx> writes: > > > Instead of sorting it after we created an unsorted list, we could insert > > correctly into the list. > > It is unclear what problem you are solving, especially with > subjunctive "could" there. We are creating an unsorted list and > then sorting it and you see it as a problem because it is just as > easy and efficient to do the insertion sort while building up the > list? (don't react and answer without reading all the way to the > end; I think I know what is going on). > > > As the unsorted append is in order of cache entry > > names, this is already sorted if names were equal to paths for submodules. > > That may be a statement of a fact, but it is unclear how that fact > relates to what the code is doing before this patch, or what the > code updated by this patch is doing. > > > As submodule names are often the same as their path, the input is sorted > > pretty well already, so let's just do the sort afterwards. > > It is unclear what (performance?) trade-off this senttence is trying > to make. It sounds as if it is claiming this: > > We can string_list_insert() to maintain sorted-ness of the > list as we find new items, or we can string_list_append() to > build an unsorted list and sort it at the end just once. > > To pick which one is more appropriate, we notice the fact > that we discover new items more or less in the already > sorted order. That makes "append then sort" more > appropriate. > > But is that reasoning sensible? > > I'd imagine that append-then-sort would always be more efficient > than insert-at-the-right-place-as-we-go, and the reason why we > sometimes would need to do the latter is when we need to look up > elements in the middle of building the list (e.g. we may want to > de-dup, which requires us to look up a name from the ones we > collected so far). If we come across a (mostly) sorted list, then the insert-at-the-right-place we'd come across the best case of insertion sort which is O(n), which sounds better than append-then-sort as our sorting is a merge sort, which has O(n log n) even in its best case (and needs to copy stuff into a temp buffer and back). By having the submodules named after its path, I strongly suspect we have a mostly sorted list in nearly all cases except some really interesting corner cases out there. > And in this application, calculate_changed_submodule_paths() > discover paths by calling collect_changed_submodules() which finds a > mapping <submodule name, oid of commits> into a list sorted by > submodule name, and then goes through that list and builds a list of > submodules paths (which could be different from submodule names) by > appending. Only after this list is fully built, get_next_submodule() > gets called, so making the latter use string_list_lookup() that assumes > a sorted list is safe if we built the list by append-then-sort (iow, > sortedness while building the list does not matter). > > Having analysed all that, I find it somewhat iffy that _append() is > used there in the calculate_changed_submodule_paths() function. Note that this is fixed in the later patch "submodule.c: do not copy around submodule list" > It > would cause the resulting changed_submodule_names list to hold the > same name twice (or more), This would be possible if there is a submodule at path A and another submodule (at a different path) named "A", as we'll try hard to collect names, but are also okay with path as we want to keep supporting the historical use case of submodules. > but I do not know if that would pose a > problem to the consumer of the list. Using "accumulate then sort > before calling look-up" would not change it as string_list_sort() > would not dedup, so I do not think this patch would introduce a new > problem, though. Yes, that is true, so we'd want to extend the message above to mention the potential duplicates. Thanks, Stefan