Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> wrote: > onsdag 10 december 2008 23:05:49 skrev Shawn O. Pearce: > > Git patch files can contain file names which are quoted using the > > C language quoting rules. In order to correctly create or parse > > Should we maybe call this Git-style since we really do not care > about C (which version btw?). Yea, I think you are right. I'll change the name. > Making two interfaces is better. We may share the implementation initially, > but parsing file names in Git patches and parsing C strings are different > operations. Well, if we ever supported other C-style string names we could just swap the implementation reference. I'll try to make it clearer this format, although like a C-style string, is really meant for Git path names within patches. > > + public void testQuote_OctalAll() { > > + assertQuote("\1", "\\001"); > > + assertQuote("~", "\\176"); > > + assertQuote("\u00ff", "\\303\\277"); // \u00ff in UTF-8 > > + } > > What do we do with non-UTF8 names? I think we should > follow the logic we use when parsing commits and paths > in other places. Then we're totally f'd. Git has no specific encoding on file names. If we get a standard Java Unicode string and get asked to quote it characters with code points above 127 need to be escaped as an octal escape code according to the Git style. Further the Git style only permits octal escapes that result in a value <= 255, aka an unsigned char. The name needs to be encoded into an 8-bit encoding, and UTF-8 is the only encoding that will represent every valid Unicode character. Elsewhere we sort of take the attitude that when writing data *out* we produce UTF-8, even if we read in ISO-whatever. Here I'm doing the same thing. > > + public void testDequote_UnknownEscapeQ() { > > + assertDequote("\\q", "\\q"); > > + } > > Would Git generate this style in a name? No, but a mangled patch might have it. Rather than throw an exception I try to parse the string as faithfully as I can, so we can continue on and mine the stream for more information. We may be able to work around the breakage (there's a lot of redundant data about path names in the git patch format). As far as I can tell C Git handles this style of escape the same way: $ cat F diff --git "a/\q" "b/\q" --- "a/\q" +++ "b/\q" $ git apply --stat F "\\q\"" | 0 1 files changed, 0 insertions(+), 0 deletions(-) Actually, I think there's a bug, its reading the closing quote as part of the file name. But anyway, my point is that C git handles an unknown escape sequence as though it were a literal. > > + quote['\b'] = 'b'; > \e = esc Oddly enough, C Git doesn't output \e. It uses \033. It doesn't recognize \e either, so we can't produce it even if we wanted to honor it on input. I can add it to our input table (but keep it out of our output table), but then we'd treat \e as \033 while C Git would treat \e as a literal. Bad. So no, I won't add \e here. > > + return decode(Constants.CHARSET, r, 0, rPtr); > > Importing methods really obscures things. Please qualify with class name > of RawparseUtils here instead. Hmmph. I wonder how you'll feel then about the patch parser code. I import 4 methods from RawParseUtils because they are so commonly invoked that I got tired of reading RawParseUtils.foo. There's only two calls here so I changed it to be qualified. -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html