Re: [JGIT PATCH 4/4] Intelligent parsing of ambiguously encoded meta data.

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Sun, 12 Oct 2008 19:27:08 -0700

Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> wrote:
> We cannot trust meta data to be encoded in any particular way, so we try
> different encodings. First we try UTF-8, which is the only sane encoding
> for non-local data, even when used in regions where eight bit legacy
> encodings are common. The chance of mistakenly parsing non-UTF-8 data
> as valid UTF-8 is varies from extremely low (western encodings) to low
> for most other encodings. If the data does not look like UTF-8, we try the
> suggested encoding. If that fails we try the user locale and finally, if
> that fails we try ISO-8859-1, which cannot fail.

Hmm.  I'm concerned about the infinite loop you have here.
If ISO-8859-1 fails we'd be stuck here until the end of time.
Plus its a bit ugly to read.

I wonder if this is any better.  It passes your tests and is 2
lines shorter.

diff --git a/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java b/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
index a31734b..6c0e339 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
@@ -42,7 +42,10 @@
 import static org.spearce.jgit.lib.ObjectChecker.encoding;
 
 import java.nio.ByteBuffer;
+import java.nio.charset.CharacterCodingException;
 import java.nio.charset.Charset;
+import java.nio.charset.CharsetDecoder;
+import java.nio.charset.CodingErrorAction;
 import java.util.Arrays;
 
 import org.spearce.jgit.lib.Constants;
@@ -376,7 +379,10 @@ public static PersonIdent parsePersonIdent(final byte[] raw, final int nameB) {
 	}
 
 	/**
-	 * Decode a region of the buffer under the specified character set.
+	 * Decode a region of the buffer under the specified character set if possible.
+	 *
+	 * If the byte stream cannot be decoded that way, the platform default is tried
+	 * and if that too fails, the fail-safe ISO-8859-1 encoding is tried.
 	 * 
 	 * @param cs
 	 *            character set to use when decoding the buffer.
@@ -393,7 +399,56 @@ public static PersonIdent parsePersonIdent(final byte[] raw, final int nameB) {
 	public static String decode(final Charset cs, final byte[] buffer,
 			final int start, final int end) {
 		final ByteBuffer b = ByteBuffer.wrap(buffer, start, end - start);
-		return cs.decode(b).toString();
+		b.mark();
+
+		// Try our built-in favorite. The assumption here is that
+		// decoding will fail if the data is not actually encoded
+		// using that encoder.
+		//
+		try {
+			return decode(b, Constants.CHARSET);
+		} catch (CharacterCodingException e) {
+			b.reset();
+		}
+
+		if (!cs.equals(Constants.CHARSET)) {
+			// Try the suggested encoding, it might be right since it was
+			// provided by the caller.
+			//
+			try {
+				return decode(b, cs);
+			} catch (CharacterCodingException e) {
+				b.reset();
+			}
+		}
+
+		// Try the default character set. A small group of people
+		// might actually use the same (or very similar) locale.
+		//
+		final Charset defcs = Charset.defaultCharset();
+		if (!defcs.equals(cs) && !defcs.equals(Constants.CHARSET)) {
+			try {
+				return decode(b, defcs);
+			} catch (CharacterCodingException e) {
+				b.reset();
+			}
+		}
+
+		// Fall back to an ISO-8859-1 style encoding. At least all of
+		// the bytes will be present in the output.
+		//
+		final StringBuilder r = new StringBuilder(end - start);
+		for (int i = start; i < end; i++)
+			r.append((char) (buffer[i] & 0xff));
+		return r.toString();
+	}
+
+	private static String decode(final ByteBuffer b, final Charset charset)
+			throws CharacterCodingException {
+		final CharsetDecoder d = charset.newDecoder();
+		d.onMalformedInput(CodingErrorAction.REPORT);
+		d.onUnmappableCharacter(CodingErrorAction.REPORT);
+		return d.decode(b).toString();
 	}
 
 	/**


-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html