On Thu, May 8, 2008 at 3:13 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > I haven't tested this at all (this is a lunchtime hack) and have a mild > suspicion that it may have corner case miscounting (e.g. I blindly > subtracts 3 from len when dealing with a line that represents a single > token from the internal diff output --- do I always have 3 there even when > the original file ends with an incomplete line? I didn't check), but > other than that I think this is a lot easier to read and follow. > > --- > > diff.c | 216 +++++++++++++++++++++++++++++++-------------------------------- > 1 files changed, 106 insertions(+), 110 deletions(-) > > diff --git a/diff.c b/diff.c > index e35384b..344aaa6 100644 > --- a/diff.c > +++ b/diff.c > @@ -351,87 +351,119 @@ static int fill_mmfile(mmfile_t *mf, struct diff_filespec *one) > return 0; > } > > -struct diff_words_buffer { > - mmfile_t text; > - long alloc; > - long current; /* output pointer */ > - int suppressed_newline; > +typedef unsigned long (*sane_truncate_fn)(char *line, unsigned long len); > + > +struct emit_callback { > + struct xdiff_emit_state xm; > + int nparents, color_diff; > + unsigned ws_rule; > + sane_truncate_fn truncate; > + const char **label_path; > + struct diff_words_data *diff_words; > + int *found_changesp; > + FILE *file; > }; > > -static void diff_words_append(char *line, unsigned long len, > - struct diff_words_buffer *buffer) > +static size_t diff_words_tokenize(struct emit_callback *ecbdata, > + char *line, unsigned long len) > { > - if (buffer->text.size + len > buffer->alloc) { > - buffer->alloc = (buffer->text.size + len) * 3 / 2; > - buffer->text.ptr = xrealloc(buffer->text.ptr, buffer->alloc); > + /* > + * This function currently is deliberately done very stupid, > + * but passing ecbdata here means that you can potentially > + * implement different tokenization rules depending on > + * the content (e.g. "gitattributes(5)"). > + */ > + int is_space; > + char *line0 = line; > + > + if (!len) > + return 0; > + > + is_space = isspace(*line); > + while (len && (isspace(*line) == is_space)) { > + line++; > + len--; > } > + return line - line0; > +} > + > +static void diff_words_append(struct emit_callback *ecbdata, > + char *line, unsigned long len, > + struct strbuf *text) > +{ > + /* Skip leading +/- first. */ > line++; > len--; > - memcpy(buffer->text.ptr + buffer->text.size, line, len); > - buffer->text.size += len; > + > + /* > + * Tokenize and stuff the words in. > + */ > + while (len) { > + size_t token_len = diff_words_tokenize(ecbdata, line, len); > + > + if (line[0] != '\n') { > + /* > + * A nonempty token has ' ' stuffed in front, > + * so that we can recover the original > + * end-of-line easily. Stupid, but works. > + */ > + strbuf_add(text, " ", 1); > + strbuf_add(text, line, token_len); > + strbuf_add(text, "\n", 1); > + len -= token_len; > + line += token_len; I still don't understand why a ' ' is prepended. See my comment for the following part > + if (line[1] == ' ') { > + /* A token */ > + line += 2; > + len -= 3; /* drop the trailing LF */ > + } else { > + /* A real LF */ > + line++; > + len--; > } I think we can recognize a real LF by that the diff line should be a single '\n', i.e. line[1] == '\n'. So what's wrong by s/line[1] == ' '/line[1] != '\n'/ ? -- Ping Yin -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html