Re: [PATCH] log --format: document %w

René Scharfe <rene.scharfe@xxxxxxxxxxxxxx> · Mon, 23 Nov 2009 00:46:34 +0100

Junio C Hamano schrieb:
> René Scharfe <rene.scharfe@xxxxxxxxxxxxxx> writes:
> 
>> I'm not especially proud of the triple negative in that note.  How to say it
>> better, yet concise?
>> +- '%w([<w>[,<i1>[,<i2>]]])': switch line wrapping, like the -w option of
>> +  linkgit:git-shortlog[1].  NOTE: Color placeholders (`%C*`) are not
>> +  recognized as having no width, so they should not be put into wrapped
>> +  sections.
> 
> "The code miscounts the width of '%C*' color placeholders"?
> 
> Perhaps somebody in the codepath leading to pick_one_utf8_char() in utf8.c
> can be made aware of them?
> 
> utf8_width() is called from many places (has one caller outside utf8.c as
> well).  It is given a pointer to a pointer that points at the current
> position in a string, and is responsible for picking up one logical letter
> advancing the given pointer to skip over that letter, and returning the
> display width of that one letter.  The function wants the string to be
> encoded in utf-8 and signals by putting NULL in the pointer when it
> detects the input string is not.
> 
> Picking up one logical letter is done by pick_one_utf8_char(), which is a
> nicely written generic "We are at the character boundary of a potentially
> multi-byte utf-8 string; pick the first character" implementation, and we
> wouldn't want to contaminate that with escape sequence logic---we might
> want to reuse it in other codepaths where we have no reason to expect any
> escape sequences.
> 
> So perhaps we can introduce is_esc_sequence(s, r, w) that
> 
>  - returns true if we are at the beginning of an esc-sequence;
>  - skips the sequence just like utf8_width() does with s and r; and
>  - counts the width of the sequence and returns it in *w
> 
> The implementation of the is_esc_sequence() could be to only detect the
> color sequence (if the sequence has things like cursor-position control
> then we are already lost, as calling "utf8_width()" on such a string does
> not make much sense anyway) and report zero-width.
> 
> I dunno.
> 
> diff --git a/utf8.c b/utf8.c
> index 5c18f0c..d45e75f 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -241,7 +241,12 @@ invalid:
>   */
>  int utf8_width(const char **start, size_t *remainder_p)
>  {
> -	ucs_char_t ch = pick_one_utf8_char(start, remainder_p);
> +	ucs_char_t ch;
> +	int w;
> +
> +	if (is_esc_sequence(start, remainder_p, &w))
> +		return w;
> +	ch = pick_one_utf8_char(start, remainder_p);
>  	if (!*start)
>  		return 0;
>  	return git_wcwidth(ch);

I think utf8_width() is too generic for that; we shouldn't teach it terminal
control details.  Something like this?  It keeps it all local to
strbuf_add_wrapped_text(); ignoring display mode escape codes in there can be
justified with its purpose.

 utf8.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/utf8.c b/utf8.c
index 5c18f0c..fcc0aeb 100644
--- a/utf8.c
+++ b/utf8.c
@@ -298,6 +298,21 @@ static void print_spaces(struct strbuf *buf, int count)
 	strbuf_write(buf, s, count);
 }
 
+/* XXX: this handles display mode sequences, only.  Do we need more? */
+static size_t esc_sequence_len(const char *s)
+{
+	const char *p = s;
+	if (*p++ != '\033')
+		return 0;
+	if (*p++ != '[')
+		return 0;
+	while (isdigit(*p) || *p == ';')
+		p++;
+	if (*p++ != 'm')
+		return 0;
+	return p - s;
+}
+
 /*
  * Wrap the text, if necessary. The variable indent is the indent for the
  * first line, indent2 is the indent for all other lines.
@@ -329,7 +344,13 @@ int strbuf_add_wrapped_text(struct strbuf *buf,
 	}
 
 	for (;;) {
-		char c = *text;
+		char c;
+		size_t skip;
+
+		while ((skip = esc_sequence_len(text)))
+			text += skip;
+
+		c = *text;
 		if (!c || isspace(c)) {
 			if (w < width || !space) {
 				const char *start = bol;
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html