On Thu, 13 Feb 2025 at 22:41, Elijah Newren <newren@xxxxxxxxx> wrote: > > On Thu, Feb 13, 2025 at 1:01 AM Meet Soni <meetsoni3017@xxxxxxxxx> wrote: > > > > Previously, `get_unmerged()` used `string_list_insert()`, which has an > > O(n^2) complexity due to shifting elements on each insertion. It also > > called `string_list_lookup()` before insertion, which performs a binary > > search in O(log n). > > Okay. > > > This combination made insertion costly, especially > > for large index states, as each new entry required both a search and > > potentially shifting many elements. > > Why does the combination make it costly? O(log n) + O(n^2) is still > O(n^2), so I don't see why it matters to mention the combination. > Could you clarify? > > Also, does it actually make it costly, or do you only suspect that it > does? O(n^2) worst case sometimes behaves O(n) or O(n log n) in some > cases. Since your commit message says "made insertion costly" instead > of "might make insertion costly", I think that would suggest you have > some performance numbers to back this up on some interesting real > world repository. Do you? Can you share them? > Sorry, I should've specified, this patch is purely theoretical, I was aiming for a trial and error kind of approach. > > Replace `string_list_insert()` with `string_list_append()` to achieve > > O(n) insertion. After all entries are added, sort the list in O(n log n) > > and remove duplicates in O(n), reducing the overall complexity to > > O(n log n). > > Okay. > > > This improves performance significantly for large datasets > > That's a big claim; it may be true, but without evidence I don't > believe it for three reasons : (1) n here is the number of conflicts, > not the number of files in the repo or the number of lines being > merged. Thus, n is typically small. (2) Other O(n^2) behavior in > merge-recursive likely drowns this particular codepath out, so any > gains here just aren't going to be noticed, (3) After looking at the > code and knowing the specialized structure of the index, I think that > while string_list_insert() for n items in general is going to be > O(n^2), it will likely functionally be O(n log n) for this particular > code path, meaning you haven't actually improved the performance. > > > while maintaining correctness. > > More on that below. > > > > Signed-off-by: Meet Soni <meetsoni3017@xxxxxxxxx> > > --- > > merge-recursive.c | 10 +++++----- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/merge-recursive.c b/merge-recursive.c > > index 884ccf99a5..6165993429 100644 > > --- a/merge-recursive.c > > +++ b/merge-recursive.c > > @@ -547,15 +547,15 @@ static struct string_list *get_unmerged(struct index_state *istate) > > if (!ce_stage(ce)) > > continue; > > > > - item = string_list_lookup(unmerged, ce->name); > > - if (!item) { > > - item = string_list_insert(unmerged, ce->name); > > - item->util = xcalloc(1, sizeof(struct stage_data)); > > - } > > + item = string_list_append(unmerged, ce->name); > > + item->util = xcalloc(1, sizeof(struct stage_data)); > > + > > e = item->util; > > e->stages[ce_stage(ce)].mode = ce->ce_mode; > > oidcpy(&e->stages[ce_stage(ce)].oid, &ce->oid); > > Did you run any tests? I'm not sure you maintained correctness here. I didn't run any tests -- I wanted to, but I wasn’t sure how to do it for this change. Since you suggested dropping this patch from the series, I’ll do that. But for similar changes in the future, how should I go about testing them? > > > } > > + string_list_sort(unmerged); > > + string_list_remove_duplicates(unmerged, 1); > > > > return unmerged; > > } > > -- > > 2.34.1 > > (As a side note, due to the specialized structure of the input, I > suspect this code could be modified to run in O(n), i.e. we could skip > the string_list_lookup and the string_list_sort and the > string_list_remove_duplicates... But, it'd make the code trickier, so > it'd need to be carefully commented, the change would need to be > justified, and it'd need to be carefully tested. Even if we weren't > planning to delete this entire file, I suspect it's not possible to > find a case justifying such a change without optimizing several other > things in merge-recursive first, but optimizing those things probably > results in a significant rewrite...which we've already done with > merge-ort.) Makes sense. Thankyou for reviewing, Meet