[JGIT PATCH 2/2 v2] Add getScriptText functions to obtain the plain-text version of a patch

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Wed, 17 Dec 2008 12:13:26 -0800

The conversion from byte[] to String is performed one file at a time,
in case the patch is a character encoding conversion patch for the
file.  For simplicity we currently assume UTF-8 still as the default
encoding for any content, but eventually we should support using the
.gitattributes encoding property when performing this conversion.

Signed-off-by: Shawn O. Pearce <spearce@xxxxxxxxxxx>
---
  Robin Rosenberg <robin.rosenberg.lists@xxxxxxxxxx> wrote:
  > > For usefulness we must be able to pass the encoding from outside, 
  > > e.g. the encoding Eclipse uses, which often is not UTF-8-
  > 
  > It's even worse. You should probably do the encoding guess on the whole
  > patch, or per file and not per line so make success possible at all. Reading
  > and writing as ISO-8859-1 will always work as that is just padding every
  > byte with NUL on reading and dropping it on writing. I.e. if your convert
  > to char at all...

  So this patch does the "whole file" thing.  But there is a
  fast-path in getScriptText to try and bypass the multiple copies
  we have to make in order to shovel the entire file into the
  CharsetDecoder just to read the patch.  It isn't common to see
  a character set conversion patch, so the fast case of decoding
  the whole patch text at once should happen most of the time.

 .../jgit/patch/testGetText_BothISO88591.patch      |   21 +++
 .../spearce/jgit/patch/testGetText_Convert.patch   |   21 +++
 .../spearce/jgit/patch/testGetText_DiffCc.patch    |   13 ++
 .../spearce/jgit/patch/testGetText_NoBinary.patch  |    4 +
 .../tst/org/spearce/jgit/patch/GetTextTest.java    |  142 ++++++++++++++++++++
 .../org/spearce/jgit/patch/CombinedFileHeader.java |   27 ++++
 .../org/spearce/jgit/patch/CombinedHunkHeader.java |  127 +++++++++++++++++
 .../src/org/spearce/jgit/patch/FileHeader.java     |  116 ++++++++++++++++
 .../src/org/spearce/jgit/patch/HunkHeader.java     |   86 ++++++++++++
 .../src/org/spearce/jgit/util/RawParseUtils.java   |   57 ++++++++-
 10 files changed, 611 insertions(+), 3 deletions(-)
 create mode 100644 org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_BothISO88591.patch
 create mode 100644 org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_Convert.patch
 create mode 100644 org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_DiffCc.patch
 create mode 100644 org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_NoBinary.patch
 create mode 100644 org.spearce.jgit.test/tst/org/spearce/jgit/patch/GetTextTest.java

diff --git a/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_BothISO88591.patch b/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_BothISO88591.patch
new file mode 100644
index 0000000..8224fcc
--- /dev/null
+++ b/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_BothISO88591.patch
@@ -0,0 +1,21 @@
+diff --git a/X b/X
+index 014ef30..8c80a36 100644
+--- a/X
++++ b/X
+@@ -1,7 +1,7 @@
+ a
+ b
+ c
+-�ngstr�m
++line 4 �ngstr�m
+ d
+ e
+ f
+@@ -13,6 +13,6 @@ k
+ l
+ m
+ n
+-�ngstr�m
++�ngstr�m; line 16
+ o
+ p
diff --git a/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_Convert.patch b/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_Convert.patch
new file mode 100644
index 0000000..a43fef5
--- /dev/null
+++ b/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_Convert.patch
@@ -0,0 +1,21 @@
+diff --git a/X b/X
+index 014ef30..209db0d 100644
+--- a/X
++++ b/X
+@@ -1,7 +1,7 @@
+ a
+ b
+ c
+-�ngstr�m
++Ångström
+ d
+ e
+ f
+@@ -13,6 +13,6 @@ k
+ l
+ m
+ n
+-�ngstr�m
++Ångström
+ o
+ p
diff --git a/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_DiffCc.patch b/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_DiffCc.patch
new file mode 100644
index 0000000..3f74a52
--- /dev/null
+++ b/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_DiffCc.patch
@@ -0,0 +1,13 @@
+diff --cc X
+index bdfc9f4,209db0d..474bd69
+--- a/X
++++ b/X
+@@@ -1,7 -1,7 +1,7 @@@
+  a
+--b
+  c
+ +test �ngstr�m
++ Ångström
+  d
+  e
+  f
diff --git a/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_NoBinary.patch b/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_NoBinary.patch
new file mode 100644
index 0000000..e4968dc
--- /dev/null
+++ b/org.spearce.jgit.test/tst-rsrc/org/spearce/jgit/patch/testGetText_NoBinary.patch
@@ -0,0 +1,4 @@
+diff --git a/org.spearce.egit.ui/icons/toolbar/fetchd.png b/org.spearce.egit.ui/icons/toolbar/fetchd.png
+new file mode 100644
+index 0000000..4433c54
+Binary files /dev/null and b/org.spearce.egit.ui/icons/toolbar/fetchd.png differ
diff --git a/org.spearce.jgit.test/tst/org/spearce/jgit/patch/GetTextTest.java b/org.spearce.jgit.test/tst/org/spearce/jgit/patch/GetTextTest.java
new file mode 100644
index 0000000..04810be
--- /dev/null
+++ b/org.spearce.jgit.test/tst/org/spearce/jgit/patch/GetTextTest.java
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2008, Google Inc.
+ *
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ *   copyright notice, this list of conditions and the following
+ *   disclaimer in the documentation and/or other materials provided
+ *   with the distribution.
+ *
+ * - Neither the name of the Git Development Community nor the
+ *   names of its contributors may be used to endorse or promote
+ *   products derived from this software without specific prior
+ *   written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
+ * CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package org.spearce.jgit.patch;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.nio.charset.Charset;
+
+import junit.framework.TestCase;
+
+public class GetTextTest extends TestCase {
+	public void testGetText_BothISO88591() throws IOException {
+		final Charset cs = Charset.forName("ISO-8859-1");
+		final Patch p = parseTestPatchFile();
+		assertTrue(p.getErrors().isEmpty());
+		assertEquals(1, p.getFiles().size());
+		final FileHeader fh = p.getFiles().get(0);
+		assertEquals(2, fh.getHunks().size());
+		assertEquals(readTestPatchFile(cs), fh.getScriptText(cs, cs));
+	}
+
+	public void testGetText_NoBinary() throws IOException {
+		final Charset cs = Charset.forName("ISO-8859-1");
+		final Patch p = parseTestPatchFile();
+		assertTrue(p.getErrors().isEmpty());
+		assertEquals(1, p.getFiles().size());
+		final FileHeader fh = p.getFiles().get(0);
+		assertEquals(0, fh.getHunks().size());
+		assertEquals(readTestPatchFile(cs), fh.getScriptText(cs, cs));
+	}
+
+	public void testGetText_Convert() throws IOException {
+		final Charset csOld = Charset.forName("ISO-8859-1");
+		final Charset csNew = Charset.forName("UTF-8");
+		final Patch p = parseTestPatchFile();
+		assertTrue(p.getErrors().isEmpty());
+		assertEquals(1, p.getFiles().size());
+		final FileHeader fh = p.getFiles().get(0);
+		assertEquals(2, fh.getHunks().size());
+
+		// Read the original file as ISO-8859-1 and fix up the one place
+		// where we changed the character encoding. That makes the exp
+		// string match what we really expect to get back.
+		//
+		String exp = readTestPatchFile(csOld);
+		exp = exp.replace("\303\205ngstr\303\266m", "\u00c5ngstr\u00f6m");
+
+		assertEquals(exp, fh.getScriptText(csOld, csNew));
+	}
+
+	public void testGetText_DiffCc() throws IOException {
+		final Charset csOld = Charset.forName("ISO-8859-1");
+		final Charset csNew = Charset.forName("UTF-8");
+		final Patch p = parseTestPatchFile();
+		assertTrue(p.getErrors().isEmpty());
+		assertEquals(1, p.getFiles().size());
+		final CombinedFileHeader fh = (CombinedFileHeader) p.getFiles().get(0);
+		assertEquals(1, fh.getHunks().size());
+
+		// Read the original file as ISO-8859-1 and fix up the one place
+		// where we changed the character encoding. That makes the exp
+		// string match what we really expect to get back.
+		//
+		String exp = readTestPatchFile(csOld);
+		exp = exp.replace("\303\205ngstr\303\266m", "\u00c5ngstr\u00f6m");
+
+		assertEquals(exp, fh
+				.getScriptText(new Charset[] { csNew, csOld, csNew }));
+	}
+
+	private Patch parseTestPatchFile() throws IOException {
+		final String patchFile = getName() + ".patch";
+		final InputStream in = getClass().getResourceAsStream(patchFile);
+		if (in == null) {
+			fail("No " + patchFile + " test vector");
+			return null; // Never happens
+		}
+		try {
+			final Patch p = new Patch();
+			p.parse(in);
+			return p;
+		} finally {
+			in.close();
+		}
+	}
+
+	private String readTestPatchFile(final Charset cs) throws IOException {
+		final String patchFile = getName() + ".patch";
+		final InputStream in = getClass().getResourceAsStream(patchFile);
+		if (in == null) {
+			fail("No " + patchFile + " test vector");
+			return null; // Never happens
+		}
+		try {
+			final InputStreamReader r = new InputStreamReader(in, cs);
+			char[] tmp = new char[2048];
+			final StringBuilder s = new StringBuilder();
+			int n;
+			while ((n = r.read(tmp)) > 0)
+				s.append(tmp, 0, n);
+			return s.toString();
+		} finally {
+			in.close();
+		}
+	}
+}
diff --git a/org.spearce.jgit/src/org/spearce/jgit/patch/CombinedFileHeader.java b/org.spearce.jgit/src/org/spearce/jgit/patch/CombinedFileHeader.java
index 3ccc418..a27e0f8 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/patch/CombinedFileHeader.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/patch/CombinedFileHeader.java
@@ -41,7 +41,9 @@
 import static org.spearce.jgit.util.RawParseUtils.match;
 import static org.spearce.jgit.util.RawParseUtils.nextLF;
 
+import java.nio.charset.Charset;
 import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.List;
 
 import org.spearce.jgit.lib.AbbreviatedObjectId;
@@ -111,6 +113,31 @@ public AbbreviatedObjectId getOldId(final int nthParent) {
 		return oldIds[nthParent];
 	}
 
+	@Override
+	public String getScriptText(final Charset ocs, final Charset ncs) {
+		final Charset[] cs = new Charset[getParentCount() + 1];
+		Arrays.fill(cs, ocs);
+		cs[getParentCount()] = ncs;
+		return getScriptText(cs);
+	}
+
+	/**
+	 * Convert the patch script for this file into a string.
+	 * 
+	 * @param charsetGuess
+	 *            optional array to suggest the character set to use when
+	 *            decoding each file's line. If supplied the array must have a
+	 *            length of <code>{@link #getParentCount()} + 1</code>
+	 *            representing the old revision character sets and the new
+	 *            revision character set.
+	 * @return the patch script, as a Unicode string.
+	 */
+	@Override
+	public String getScriptText(final Charset[] charsetGuess) {
+		return super.getScriptText(charsetGuess);
+	}
+
+	@Override
 	int parseGitHeaders(int ptr, final int end) {
 		while (ptr < end) {
 			final int eol = nextLF(buf, ptr);
diff --git a/org.spearce.jgit/src/org/spearce/jgit/patch/CombinedHunkHeader.java b/org.spearce.jgit/src/org/spearce/jgit/patch/CombinedHunkHeader.java
index 3e5c465..83ea681 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/patch/CombinedHunkHeader.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/patch/CombinedHunkHeader.java
@@ -40,6 +40,9 @@
 import static org.spearce.jgit.util.RawParseUtils.nextLF;
 import static org.spearce.jgit.util.RawParseUtils.parseBase10;
 
+import java.io.IOException;
+import java.io.OutputStream;
+
 import org.spearce.jgit.lib.AbbreviatedObjectId;
 import org.spearce.jgit.util.MutableInteger;
 
@@ -188,4 +191,128 @@ int parseBody(final Patch script, final int end) {
 
 		return c;
 	}
+
+	@Override
+	void extractFileLines(final OutputStream[] out) throws IOException {
+		final byte[] buf = file.buf;
+		int ptr = startOffset;
+		int eol = nextLF(buf, ptr);
+		if (endOffset <= eol)
+			return;
+
+		// Treat the hunk header as though it were from the ancestor,
+		// as it may have a function header appearing after it which
+		// was copied out of the ancestor file.
+		//
+		out[0].write(buf, ptr, eol - ptr);
+
+		SCAN: for (ptr = eol; ptr < endOffset; ptr = eol) {
+			eol = nextLF(buf, ptr);
+
+			if (eol - ptr < old.length + 1) {
+				// Line isn't long enough to mention the state of each
+				// ancestor. It must be the end of the hunk.
+				break SCAN;
+			}
+
+			switch (buf[ptr]) {
+			case ' ':
+			case '-':
+			case '+':
+				break;
+
+			default:
+				// Line can't possibly be part of this hunk; the first
+				// ancestor information isn't recognizable.
+				//
+				break SCAN;
+			}
+
+			int delcnt = 0;
+			for (int ancestor = 0; ancestor < old.length; ancestor++) {
+				switch (buf[ptr + ancestor]) {
+				case '-':
+					delcnt++;
+					out[ancestor].write(buf, ptr, eol - ptr);
+					continue;
+
+				case ' ':
+					out[ancestor].write(buf, ptr, eol - ptr);
+					continue;
+
+				case '+':
+					continue;
+
+				default:
+					break SCAN;
+				}
+			}
+			if (delcnt < old.length) {
+				// This line appears in the new file if it wasn't deleted
+				// relative to all ancestors.
+				//
+				out[old.length].write(buf, ptr, eol - ptr);
+			}
+		}
+	}
+
+	void extractFileLines(final StringBuilder sb, final String[] text,
+			final int[] offsets) {
+		final byte[] buf = file.buf;
+		int ptr = startOffset;
+		int eol = nextLF(buf, ptr);
+		if (endOffset <= eol)
+			return;
+		copyLine(sb, text, offsets, 0);
+		SCAN: for (ptr = eol; ptr < endOffset; ptr = eol) {
+			eol = nextLF(buf, ptr);
+
+			if (eol - ptr < old.length + 1) {
+				// Line isn't long enough to mention the state of each
+				// ancestor. It must be the end of the hunk.
+				break SCAN;
+			}
+
+			switch (buf[ptr]) {
+			case ' ':
+			case '-':
+			case '+':
+				break;
+
+			default:
+				// Line can't possibly be part of this hunk; the first
+				// ancestor information isn't recognizable.
+				//
+				break SCAN;
+			}
+
+			boolean copied = false;
+			for (int ancestor = 0; ancestor < old.length; ancestor++) {
+				switch (buf[ptr + ancestor]) {
+				case ' ':
+				case '-':
+					if (copied)
+						skipLine(text, offsets, ancestor);
+					else {
+						copyLine(sb, text, offsets, ancestor);
+						copied = true;
+					}
+					continue;
+
+				case '+':
+					continue;
+
+				default:
+					break SCAN;
+				}
+			}
+			if (!copied) {
+				// If none of the ancestors caused the copy then this line
+				// must be new across the board, so it only appears in the
+				// text of the new file.
+				//
+				copyLine(sb, text, offsets, old.length);
+			}
+		}
+	}
 }
diff --git a/org.spearce.jgit/src/org/spearce/jgit/patch/FileHeader.java b/org.spearce.jgit/src/org/spearce/jgit/patch/FileHeader.java
index c91f80e..66c785f 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/patch/FileHeader.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/patch/FileHeader.java
@@ -39,10 +39,15 @@
 
 import static org.spearce.jgit.lib.Constants.encodeASCII;
 import static org.spearce.jgit.util.RawParseUtils.decode;
+import static org.spearce.jgit.util.RawParseUtils.decodeNoFallback;
+import static org.spearce.jgit.util.RawParseUtils.extractBinaryString;
 import static org.spearce.jgit.util.RawParseUtils.match;
 import static org.spearce.jgit.util.RawParseUtils.nextLF;
 import static org.spearce.jgit.util.RawParseUtils.parseBase10;
 
+import java.io.IOException;
+import java.nio.charset.CharacterCodingException;
+import java.nio.charset.Charset;
 import java.util.ArrayList;
 import java.util.Collections;
 import java.util.List;
@@ -51,6 +56,8 @@
 import org.spearce.jgit.lib.Constants;
 import org.spearce.jgit.lib.FileMode;
 import org.spearce.jgit.util.QuotedString;
+import org.spearce.jgit.util.RawParseUtils;
+import org.spearce.jgit.util.TemporaryBuffer;
 
 /** Patch header describing an action for a single file path. */
 public class FileHeader {
@@ -189,6 +196,115 @@ public int getEndOffset() {
 	}
 
 	/**
+	 * Convert the patch script for this file into a string.
+	 * <p>
+	 * The default character encoding ({@link Constants#CHARSET}) is assumed for
+	 * both the old and new files.
+	 * 
+	 * @return the patch script, as a Unicode string.
+	 */
+	public String getScriptText() {
+		return getScriptText(null, null);
+	}
+
+	/**
+	 * Convert the patch script for this file into a string.
+	 * 
+	 * @param oldCharset
+	 *            hint character set to decode the old lines with.
+	 * @param newCharset
+	 *            hint character set to decode the new lines with.
+	 * @return the patch script, as a Unicode string.
+	 */
+	public String getScriptText(Charset oldCharset, Charset newCharset) {
+		return getScriptText(new Charset[] { oldCharset, newCharset });
+	}
+
+	protected String getScriptText(Charset[] charsetGuess) {
+		if (getHunks().isEmpty()) {
+			// If we have no hunks then we can safely assume the entire
+			// patch is a binary style patch, or a meta-data only style
+			// patch. Either way the encoding of the headers should be
+			// strictly 7-bit US-ASCII and the body is either 7-bit ASCII
+			// (due to the base 85 encoding used for a BinaryHunk) or is
+			// arbitrary noise we have chosen to ignore and not understand
+			// (e.g. the message "Binary files ... differ").
+			//
+			return extractBinaryString(buf, startOffset, endOffset);
+		}
+
+		if (charsetGuess != null && charsetGuess.length != getParentCount() + 1)
+			throw new IllegalArgumentException("Expected "
+					+ (getParentCount() + 1) + " character encoding guesses");
+
+		if (trySimpleConversion(charsetGuess)) {
+			Charset cs = charsetGuess != null ? charsetGuess[0] : null;
+			if (cs == null)
+				cs = Constants.CHARSET;
+			try {
+				return decodeNoFallback(cs, buf, startOffset, endOffset);
+			} catch (CharacterCodingException cee) {
+				// Try the much slower, more-memory intensive version which
+				// can handle a character set conversion patch.
+			}
+		}
+
+		final StringBuilder r = new StringBuilder(endOffset - startOffset);
+
+		// Always treat the headers as US-ASCII; Git file names are encoded
+		// in a C style escape if any character has the high-bit set.
+		//
+		final int hdrEnd = getHunks().get(0).getStartOffset();
+		for (int ptr = startOffset; ptr < hdrEnd;) {
+			final int eol = Math.min(hdrEnd, nextLF(buf, ptr));
+			r.append(extractBinaryString(buf, ptr, eol));
+			ptr = eol;
+		}
+
+		final String[] files = extractFileLines(charsetGuess);
+		final int[] offsets = new int[files.length];
+		for (final HunkHeader h : getHunks())
+			h.extractFileLines(r, files, offsets);
+		return r.toString();
+	}
+
+	private static boolean trySimpleConversion(final Charset[] charsetGuess) {
+		if (charsetGuess == null)
+			return true;
+		for (int i = 1; i < charsetGuess.length; i++) {
+			if (charsetGuess[i] != charsetGuess[0])
+				return false;
+		}
+		return true;
+	}
+
+	private String[] extractFileLines(final Charset[] csGuess) {
+		final TemporaryBuffer[] tmp = new TemporaryBuffer[getParentCount() + 1];
+		try {
+			for (int i = 0; i < tmp.length; i++)
+				tmp[i] = new TemporaryBuffer();
+			for (final HunkHeader h : getHunks())
+				h.extractFileLines(tmp);
+
+			final String[] r = new String[tmp.length];
+			for (int i = 0; i < tmp.length; i++) {
+				Charset cs = csGuess != null ? csGuess[i] : null;
+				if (cs == null)
+					cs = Constants.CHARSET;
+				r[i] = RawParseUtils.decode(cs, tmp[i].toByteArray());
+			}
+			return r;
+		} catch (IOException ioe) {
+			throw new RuntimeException("Cannot convert script to text", ioe);
+		} finally {
+			for (final TemporaryBuffer b : tmp) {
+				if (b != null)
+					b.destroy();
+			}
+		}
+	}
+
+	/**
 	 * Get the old name associated with this file.
 	 * <p>
 	 * The meaning of the old name can differ depending on the semantic meaning
diff --git a/org.spearce.jgit/src/org/spearce/jgit/patch/HunkHeader.java b/org.spearce.jgit/src/org/spearce/jgit/patch/HunkHeader.java
index 12c670d..fc30311 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/patch/HunkHeader.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/patch/HunkHeader.java
@@ -41,6 +41,9 @@
 import static org.spearce.jgit.util.RawParseUtils.nextLF;
 import static org.spearce.jgit.util.RawParseUtils.parseBase10;
 
+import java.io.IOException;
+import java.io.OutputStream;
+
 import org.spearce.jgit.lib.AbbreviatedObjectId;
 import org.spearce.jgit.util.MutableInteger;
 
@@ -240,4 +243,87 @@ int parseBody(final Patch script, final int end) {
 
 		return c;
 	}
+
+	void extractFileLines(final OutputStream[] out) throws IOException {
+		final byte[] buf = file.buf;
+		int ptr = startOffset;
+		int eol = nextLF(buf, ptr);
+		if (endOffset <= eol)
+			return;
+
+		// Treat the hunk header as though it were from the ancestor,
+		// as it may have a function header appearing after it which
+		// was copied out of the ancestor file.
+		//
+		out[0].write(buf, ptr, eol - ptr);
+
+		SCAN: for (ptr = eol; ptr < endOffset; ptr = eol) {
+			eol = nextLF(buf, ptr);
+			switch (buf[ptr]) {
+			case ' ':
+			case '\n':
+			case '\\':
+				out[0].write(buf, ptr, eol - ptr);
+				out[1].write(buf, ptr, eol - ptr);
+				break;
+			case '-':
+				out[0].write(buf, ptr, eol - ptr);
+				break;
+			case '+':
+				out[1].write(buf, ptr, eol - ptr);
+				break;
+			default:
+				break SCAN;
+			}
+		}
+	}
+
+	void extractFileLines(final StringBuilder sb, final String[] text,
+			final int[] offsets) {
+		final byte[] buf = file.buf;
+		int ptr = startOffset;
+		int eol = nextLF(buf, ptr);
+		if (endOffset <= eol)
+			return;
+		copyLine(sb, text, offsets, 0);
+		SCAN: for (ptr = eol; ptr < endOffset; ptr = eol) {
+			eol = nextLF(buf, ptr);
+			switch (buf[ptr]) {
+			case ' ':
+			case '\n':
+			case '\\':
+				copyLine(sb, text, offsets, 0);
+				skipLine(text, offsets, 1);
+				break;
+			case '-':
+				copyLine(sb, text, offsets, 0);
+				break;
+			case '+':
+				copyLine(sb, text, offsets, 1);
+				break;
+			default:
+				break SCAN;
+			}
+		}
+	}
+
+	protected void copyLine(final StringBuilder sb, final String[] text,
+			final int[] offsets, final int fileIdx) {
+		final String s = text[fileIdx];
+		final int start = offsets[fileIdx];
+		int end = s.indexOf('\n', start);
+		if (end < 0)
+			end = s.length();
+		else
+			end++;
+		sb.append(s, start, end);
+		offsets[fileIdx] = end;
+	}
+
+	protected void skipLine(final String[] text, final int[] offsets,
+			final int fileIdx) {
+		final String s = text[fileIdx];
+		final int end = s.indexOf('\n', offsets[fileIdx]);
+		offsets[fileIdx] = end < 0 ? s.length() : end + 1;
+	}
 }
diff --git a/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java b/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
index 55a3001..ff89e9e 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
@@ -472,6 +472,40 @@ public static String decode(final Charset cs, final byte[] buffer) {
 	 */
 	public static String decode(final Charset cs, final byte[] buffer,
 			final int start, final int end) {
+		try {
+			return decodeNoFallback(cs, buffer, start, end);
+		} catch (CharacterCodingException e) {
+			// Fall back to an ISO-8859-1 style encoding. At least all of
+			// the bytes will be present in the output.
+			//
+			return extractBinaryString(buffer, start, end);
+		}
+	}
+
+	/**
+	 * Decode a region of the buffer under the specified character set if
+	 * possible.
+	 * 
+	 * If the byte stream cannot be decoded that way, the platform default is
+	 * tried and if that too fails, an exception is thrown.
+	 * 
+	 * @param cs
+	 *            character set to use when decoding the buffer.
+	 * @param buffer
+	 *            buffer to pull raw bytes from.
+	 * @param start
+	 *            first position within the buffer to take data from.
+	 * @param end
+	 *            one position past the last location within the buffer to take
+	 *            data from.
+	 * @return a string representation of the range <code>[start,end)</code>,
+	 *         after decoding the region through the specified character set.
+	 * @throws CharacterCodingException
+	 *             the input is not in any of the tested character sets.
+	 */
+	public static String decodeNoFallback(final Charset cs,
+			final byte[] buffer, final int start, final int end)
+			throws CharacterCodingException {
 		final ByteBuffer b = ByteBuffer.wrap(buffer, start, end - start);
 		b.mark();
 
@@ -508,9 +542,26 @@ public static String decode(final Charset cs, final byte[] buffer,
 			}
 		}
 
-		// Fall back to an ISO-8859-1 style encoding. At least all of
-		// the bytes will be present in the output.
-		//
+		throw new CharacterCodingException();
+	}
+
+	/**
+	 * Decode a region of the buffer under the ISO-8859-1 encoding.
+	 * 
+	 * Each byte is treated as a single character in the 8859-1 character
+	 * encoding, performing a raw binary->char conversion.
+	 * 
+	 * @param buffer
+	 *            buffer to pull raw bytes from.
+	 * @param start
+	 *            first position within the buffer to take data from.
+	 * @param end
+	 *            one position past the last location within the buffer to take
+	 *            data from.
+	 * @return a string representation of the range <code>[start,end)</code>.
+	 */
+	public static String extractBinaryString(final byte[] buffer,
+			final int start, final int end) {
 		final StringBuilder r = new StringBuilder(end - start);
 		for (int i = start; i < end; i++)
 			r.append((char) (buffer[i] & 0xff));
-- 
1.6.1.rc3.302.gb14d9

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html