Re: Introduction and Wikipedia and Git Blame

"jamesmikedupont@xxxxxxxxxxxxxx" <jamesmikedupont@xxxxxxxxxxxxxx> · Fri, 16 Oct 2009 20:00:17 +0200

On Fri, Oct 16, 2009 at 7:04 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes:
>
>> Then I would make modified "texts" from the blob of the file in the
>> current revision and its parent revision, by inserting newlines after
>> every single byte (probably replacing the original newlines by other
>> values, such as \x01).
>>
>> The reason for this touchup is that the diff machinery in Git only handles
>> line-based diffs.
>>
>> Then you can parse the hunk headers, adjust the offsets accordingly,...
>
> I would agree that text converted to "byte-per-line" format would be the
> easiest way to re-use the diff engine, but if you go one more step, you
> can even reusel the blame engine as well.  You convert the text into
> "byte-in-hex-and-lf" (e.g. "AB C\n" becomes "41\n42\n20\n43\n0a\n") and
> feed it into existing blame and have it produce script-readable output,
> instead of feeding that to your reinvention of blame using diff engine.
>
> You would need to postprocess the computed result (either by diff or
> blame) to lay out the final text output in either case anyway, and making
> the existing blame engine do the work for you would be a better approach,
> I think.

Please can you tell me what is the basic algorithm of the blame engine?
I will have to start reading code
How can it tell the author a given line and I like the idea of one
line per char, even the newlines would be encoded that way. If it is a
unicode char, it might be multibyte.

The script would get the blame per byte and then recode that into
something visible.

od the octal dump utility comes to mind,
od x1 -w1 will output the file in one byte widths.

Now what about the ability to just pipe the file via some tool and
then run blame on that. It would just start the line with the byte
offset and blame would emit the blame for that offset and emit the
text that is following it.

so for example :
od x1 -w1  somefile :
///////////////////////////////
Offset       value
======= ======
0052752 065347
0052754 030356
0052756 035741
0052760 136302
0052762 035346

Here we see the lines are 0052760 - 0052762 =2  apart.

and then if you want wider diffs :
od some file
////////////////////////////////////////////
Offset       values
======= ====== ====== ====== ====== ====== ====== ====== ======
0074520 051754 162613 057705 155520 047032 043654 175550 062704
0074540 164400 060340 123434 030350 040457 136010 042270 170525
0074560 165053 124677 125776 031370 000006 102076 060060 052434
0074600 176452 140240 074007 130113 100424 020010 130773 103467
0074620 052776 052421 021544 101357 120035 107562 072641 053636

Here we see the lines are 0074520 - 0074540   = 20 apart.

That way the blame tool will not be concerned with the formatting or
content, the users can write filters like they want, and blame would
only expect a byte offset...

That way, we could write something like this :
grep -b x Test.xml
0:<?xml version="1.0" encoding="UTF-8"?>
39:<gpx
107:  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
then we would get blames for those byte offsets, very simple.

We could reduce this down to : make blame take a  list of byte positions.
grep -b \n Test.gpx would be the standard behavior, emit the blame per newline.

mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html