On Tue, Jul 29, 2014 at 12:56:24PM -0700, Junio C Hamano wrote: > Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> writes: > > > When utf8_width(&src) is called with *src == NULL (because the > > source string ends with an ansi sequence), > > I am not sure what you mean by "because" here. Do you mean somebody > (who?) decides to call the utf8_width() with NULL pointer stored in > *src because of "ansi sequence"? > > What do you mean by "ansi sequence"? I'll assume that you mean > those terminal control that all use bytes with hi-bit clear. OK let's try with an example, strbuf_utf8_replace(&sb, 0, 0, "") where "sb" contains "\033[m". The expected result is exactly the same string with length of 3. After this block while ((n = display_mode_esc_sequence_len(src))) { memcpy(dst, src, n); src += n; dst += n; } src will be right after 'm', *src is NUL (iow. src == end). After n = utf8_width((const char**)&src, NULL); n is zero but 'src' is advanced by another character, so src - old_src == 1. This NUL character is counted as part of the string, so after the _setlen, we have the same string with length of _4_ (instead of 3). > At the very beginning of utf8_width(), *start can be cleared to > point at a NULL pointer by pick_one_utf8_char() if the pointer that > comes into utf8_width() originally points at an invalid UTF-8 > string, but as far as I can see, ESC (or any bytes that would be > used in those terminal control sequences like colors and cursor > control) will simply be returned as a single byte, without going > into error path that clears *start = NULL. > > Puzzled... > > > it returns 0 and steps 'src' by one. > > Here "it" refers to utf8_width()? Who steps 'src' by one? utf8_width() steps 'src'. > > Ahh, did you mean *src == NUL, i.e. "already at the end of the > string"? Yes.. I guess you have a better commit message prepared for me now :) > I think utf8_width() called with an empty string should not move the > pointer past that end-of-string NUL in the first place. It makes me > wonder if it would be a better fix to make it not to do that (and > return 0), but if we declare it is the caller's fault, perhaps we > may want to add > > if (!**start) > die("BUG: empty string to utf8_width()???"); > > at the very beginning of utf8_width(), even before it calls > pick-one-utf8-char. > > Still puzzled... I don't know. I mean, in a UTF-8 byte sequence, NUL is a valid character (part of the ASCII subset), so advancing '*start' by one makes sense. Whether we have any call sites crazy enough to do that on purpose is a different matter. > > This stepping makes strbuf_utf8_replace add NUL to the > > destination string at the end of the loop. Check and break the loop > > early. > > > > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> > > --- > > utf8.c | 2 ++ > > 1 file changed, 2 insertions(+) > > Tests? Ugh.. this was triggered by one of the alignment specifier in log --pretty. Probably easier to export strbuf_utf8_replace() as a test-command and verify the output? -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html