Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language

Jeff King <peff@xxxxxxxx> · Tue, 14 Feb 2023 11:01:31 -0500

On Tue, Feb 14, 2023 at 01:55:32AM -0500, Eric Sunshine wrote:

> On Tue, Feb 14, 2023 at 1:45 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
> > Eric Sunshine <sunshine@xxxxxxxxxxxxxx> writes:
> > > Using (presumably) valid LANG codes results in the buggy truncated
> > > output, but "LANG=C" produces the correct result:
> > >
> > >   $ for i in C en_US fr_FR de_DE ru_RU zh_CN; do printf "$i: " &&
> > > LANG=$i.UTF-8 git symbolic-ref --short HEAD; done
> > >   C: 测试-加-增加-加-增加
> > >   en_US: 测试-?
> > >   fr_FR: 测试-?
> > >   de_DE: 测试-?
> > >   ru_RU: 测试-?
> > >   zh_CN: 测试-?
> >
> > So the system cares more than just "is this a valid UTF-8 sequence?"
> > but somehow knows that the given sequence is a valid Chinese and not
> > valid English?  ---oh, no, zh_CN is rejected, but your earlier zh-CN
> > somehow was accepted?
> >
> > Now, it is beyond my ability to guess what macOS is internally doing
> > wrong X-<.
> 
> I  don't think the earlier incorrect "zh-CN" (in which I used "-"
> rather than "_") was accepted. Rather, the system simply didn't
> recognize it, thus presumably fell back to "C" locale. The same
> "correct" output results from any bogus LANG value:
> 
>   $ LANG=bogus git symbolic-ref --short HEAD
>   测试-加-增加-加-增加

Oof. So it is some weird locale thing that scanf is doing. I don't even
want to think about what the details could be. ;)

Since scanf is such a bad and error-prone interface in the first place
(and I'd actually like to put it on the banned list), what about just
parsing manually here? We are already implicitly assuming that each
rev-parse rule has a single "%.*s" in it. Armed with that knowledge,
it's not too hard to match using skip_prefix() and strip_suffix(). Or
with a little bit more custom code, we can avoid the step to pre-process
the rule strings completely. Something like:

 refs.c | 80 ++++++++++++++++++-----------------
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/refs.c b/refs.c
index e31dbcda59..8ec7426c05 100644
--- a/refs.c
+++ b/refs.c
@@ -1310,55 +1310,59 @@ int update_ref(const char *msg, const char *refname,
 			       old_oid, flags, onerr);
 }
 
+/*
+ * Check that the string refname matches a rule of the form
+ * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule
+ * "foo/%.*s/baz", and return the string "bar".
+ */
+static char *match_parse_rule(const char *refname, const char *rule)
+{
+	size_t len;
+
+	/*
+	 * Check that rule matches refname up to the first percent
+	 * in the rule. This is basically skip_prefix(), but
+	 * ending at percent in the prefix, rather than end-of-string.
+	 */
+	do {
+		if (!*rule)
+			BUG("rev-parse rule did not have percent");
+		if (*rule == '%')
+			break;
+	} while (*refname++ == *rule++);
+
+	/*
+	 * Check that we matched all the way to the "%" placeholder,
+	 * and skip past it within the rule string. If so, "refname" at
+	 * this point is the beginning of a potential match.
+	 */
+	if (!skip_prefix(rule, "%.*s", &rule))
+		return NULL;
+
+	/*
+	 * And now check that our suffix (if any) matches.
+	 */
+	if (!strip_suffix(refname, rule, &len))
+		return NULL;
+
+	return xmemdupz(refname, len);
+}
+
 char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 				   const char *refname, int strict)
 {
 	int i;
-	static char **scanf_fmts;
-	static int nr_rules;
 	char *short_name;
 	struct strbuf resolved_buf = STRBUF_INIT;
 
-	if (!nr_rules) {
-		/*
-		 * Pre-generate scanf formats from ref_rev_parse_rules[].
-		 * Generate a format suitable for scanf from a
-		 * ref_rev_parse_rules rule by interpolating "%s" at the
-		 * location of the "%.*s".
-		 */
-		size_t total_len = 0;
-		size_t offset = 0;
-
-		/* the rule list is NULL terminated, count them first */
-		for (nr_rules = 0; ref_rev_parse_rules[nr_rules]; nr_rules++)
-			/* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */
-			total_len += strlen(ref_rev_parse_rules[nr_rules]) - 2 + 1;
-
-		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), nr_rules), total_len));
-
-		offset = 0;
-		for (i = 0; i < nr_rules; i++) {
-			assert(offset < total_len);
-			scanf_fmts[i] = (char *)&scanf_fmts[nr_rules] + offset;
-			offset += xsnprintf(scanf_fmts[i], total_len - offset,
-					    ref_rev_parse_rules[i], 2, "%s") + 1;
-		}
-	}
-
-	/* bail out if there are no rules */
-	if (!nr_rules)
-		return xstrdup(refname);
-
-	/* buffer for scanf result, at most refname must fit */
-	short_name = xstrdup(refname);
-
 	/* skip first rule, it will always match */
-	for (i = nr_rules - 1; i > 0 ; --i) {
+	for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
 		int short_name_len;
 
-		if (1 != sscanf(refname, scanf_fmts[i], short_name))
+		short_name = match_parse_rule(refname, ref_rev_parse_rules[i]);
+		if (!short_name)
 			continue;
 
 		short_name_len = strlen(short_name);
@@ -1368,7 +1372,7 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		 * must fail to resolve to a valid non-ambiguous ref
 		 */
 		if (strict)
-			rules_to_fail = nr_rules;
+			rules_to_fail = NUM_REV_PARSE_RULES;
 
 		/*
 		 * check if the short name resolves to a valid ref,