On Tue, Jun 6, 2017 at 3:05 PM, Jacob Keller <jacob.keller@xxxxxxxxx> wrote: > On Tue, Jun 6, 2017 at 2:50 AM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote: >> On Mon, Jun 5, 2017 at 8:23 PM, Stefan Beller <sbeller@xxxxxxxxxx> wrote: >>> >>> > [...] >>> > "git diff" has been taught to optionally paint new lines that are >>> > the same as deleted lines elsewhere differently from genuinely new >>> > lines. >>> > >>> > Are we happy with these changes? >> >> >> I've been studiously ignoring this patch series due to lack of bandwidth. >> >>> [...] >>> Things to come, but not in this series as they are more advanced: >>> >>> Discuss if a block/line needs a minimum requirement. >>> >>> When doing reviews with this series, a couple of lines such >>> as "\t\t}" were marked as a moved, which is not wrong as they >>> really occurred in the text with opposing sign. >>> But it was annoying as it drew my attention to just closing >>> braces, which IMO is not the point of code review. >>> >>> To solve this issue I had the idea of a "minimum requirement", e.g. >>> * at least 3 consecutive lines or >>> * at least one line with at least 3 non-ws characters or >>> * compute the entropy of a given moved block and if it is too low, do >>> not mark it up. >> >> Shooting from the hip here... >> >> It seems obvious that for a line to be marked as moved, a minimum >> requirement is that >> >> 1. The line appears as both "+" and "-". >> >> That doesn't seem strong enough evidence though, and if that is the >> only criterion, I would expect a lot of boilerplate lines like "\t\t}" >> to be marked as moved. It seems like a lot of noise could be >> eliminated by *also* requiring that >> >> 2a. The line doesn't appear elsewhere in the file(s) concerned. 'elsewhere' in the opposing sign (+,-) or all the diff (including ' ' context)? This rule opens up the discussion on multi-copies, which I imagine happens a lot in configuration files. So say you have a prod and staging environment, then you might be tempted to make patches titled as: "1. preparation: duplicate common code into prod and staging" "2. Make an actual change to staging" For 1. you still want to see that there is faithful copy, but we'd have 2 postimages having these lines. Also what about de-duplication? I just stumbled upon edb0c72428 ([PATCH] diff: consolidate test helper script pieces., 2005-05-31) for unrelated reasons, but the move coloring of the same content multiple times helped me there to focus on the relevant part. >> >> Rule (2a) would probably get rid of most boilerplate lines without >> having to try to measure entropy. But it would also get rid of good use cases when not being very careful. I intentionally left out the (2a) as I am not yet sure how the move detection for multiple occurrences in post and preimage should work in the desired case. The suppression of little-entropy closing braces might be a side effect of just this. Or it can be treated separately. >> >> Maybe you are already using both criteria? I didn't see it in a quick >> perusal of the code. >> >> OTOH, it would be silly to refuse to mark lines like "\t\t}" as moved >> *only* because they appear elsewhere in the file(s). If you did so, >> you would have gaps of supposedly non-moved lines in the middle of >> moved blocks. This suggests marking as moved lines matching (1) and >> (2a) but also lines matching (1) and the following: >> >> 2b. The line is adjacent to to another line that is thought to have >> moved from the same old location to the same new location. This is what we do, a "block detection" by comparing "line runs" against the current lines. Based on these line runs we detect one block and color up adjacent blocks. >> >> Rule (2b) would be applied recursively, with the net effect being that >> any line satisfying (1) and (2a) is allowed to carry along any >> neighboring lines within the same "+"/"-" block even if they are not >> unique. So you are saying each block has to have at least one unique line? That doesn't go well with (de-)duplication IMHO. Thanks for your shot from the hip. I'll think about these rules more to see if I can make sense of them for duplication still. Thanks, Stefan