Re: [JGIT] patch-id

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Thu, 8 Oct 2009 09:28:05 -0700

Nasser Grainawi <nasser@xxxxxxxxxxxxxx> wrote:
> I'm trying to add a public getPatchId method to the jgit Patch class [...]
>
> It seems Patch does some statistical number gathering, but at no point does
> it store a 'slimmed-down' version of a patch.

It parses the patch to create FileHeader objects, one for each
file mentioned in the script.  Within each FileHeader there is a
HunkHeader object, one for each hunk present in the patch.  Within
each HunkHeader there is an EditList composed of Edit instances;
each Edit instance denotes a contiguous line range within that hunk.

Edit instances come in one of 3 forms:

  INSERT:  a run of + lines with no - lines
  DELETE:  a run of - lines with no + lines
  REPLACE: a mixture of - and + lines

and their type is actually determined by the line numbers attached
to them.  A INSERT has the same starting and ending line number on
the A side, but on the B side the ending line number is at least
one higher than the starting number.  DELETE is the reverse, and
REPLACE has both ending numbers higher than the starting number.

IIRC Edit uses 0 based offsets, so line 3 is actually position 2.

These HunkHeader and Edit instances are only available on a text
patch, binary patches use a different representation for the
binary delta.  Combined diff patches (--cc format) also lack these
HunkHeader/Edit instances as we don't have a generic n-way patch
parser yet.

> I had the idea to just iterate
> over the FileHeader's and get the byte buffer of each, but I don't think
> those buffers have the parsed data.

The HunkHeader and Edit instances really don't have the actual
line data available to them, they only have the line numbers.
To generate a patch ID you'd need to get the line data too.

Worse, IIRC the patch ID generation in C git favors a 3 line context.

In theory you could modify FileHeader or HunkHeader to produce
a RawText that uses the underlying byte[] returned by getBuffer()
as the backing store, but create a specialized IntList which has the
actual file line numbers mapped to the positions in the patch script.
To do that you'd need to re-walk the patch, like the toEditList()
method in HunkHeader does.

Given that RawText you could feed it through something like
DiffFormatter to create a patch with 3 lines of context, and hash
the relevant bits.

But... that seems like a lot of work.

Also, there is a class in Gerrit Code Review called EditList (not
to be confused with JGit's EditList class!) that really should be
moved back over to JGit.  It has some useful routines for walking
through a patch as a series of iterations.

> Short of that, suggestions for how to go about acquiring/storing a parsed
> representation of the data with maximal existing code re-use would be
> appreciated.

I'm coming up short on suggestions right now.  I'm not seeing an
easy path to this without writing a bit of code.  I think you really
just need to walk the patch... :-\

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html