Question: How range-diff lapjv algorithm work

ZheNing Hu <adlternative@xxxxxxxxx> · Wed, 8 Mar 2023 14:37:25 +0800

I was recently reading the source code of git-range-diff.

Its Steps:
1. Use git log <range> to generate two collections of patches for
ranges a and b.
2. Use the hash table on both sets to find the commit with the exact
same patch as an "exact match".
3. A cost matrix whose length and width are a.nr + b.br is generated.
The cost of a completely matched item is 0, and the cost between two
sets of patches is represented by the diffsize(a, b) of the patch.
4. Use the LAPJV algorithm to find the best match between multiple
patches of the two sets.
5. Output results.

My question is:
1. In step 3, why is the matrix size (a.nr + b.br) * (a.nr + b.br)
instead of a.nr * b.nr?

2. Why the cost(x,y) which satisfies "x ∈ [a.nr, a.nr + b.nr) y ∈ [0,
b.nr) || x ∈ [0, a.nr) y ∈ [b.nr, b. The cost of nr + a.nr)" is set to
"diffsize(a or b) * creation_factor / 100 : COST_MAX"? What is the
role of creation_factor? [1]

3. How does LAPJV work here? [2]

[1]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/range-diff.c#L310
[2]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/linear-assignment.c#L15

Thanks.
--
ZheNing Hu