Junio C Hamano <gitster@xxxxxxxxx> 于2021年5月28日周五 上午11:04写道: > > "ZheNing Hu via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > > > The raw data of blob, tree objects may contain '\0', but most of > > the logic in `ref-filter` depands on the output of the atom being > > a structured string (end with '\0'). > > Text, yes, string is also OK. But structured? Probably not. > > ... being text (specifically, no embedded NULs in it). > OK. > > Beyond, `--format=%(raw)` should not combine with `--python`, `--shell`, > > `--tcl`, `--perl` because if our binary raw data is passed to a variable > > in the host language, the host languages may cause escape errors. > > OK. I think at least --perl and possibly --python should be able to > express NULs in the "string" type we use from the host language, but > I am perfectly fine with the decision to leave it to later updates. > Yes, for the time being, unified processing --<lang> will be easier. > > +Note that `--format=%(raw)` should not combine with `--python`, `--shell`, `--tcl`, > > "should not combine" -> "cannot be used" would make it read more > naturally (ditto for the phrase used in the proposed log message). > OK, so the wording is better. > > > > struct atom_value { > > const char *s; > > + size_t s_size; > > OK, so everything used to be a C-string that cannot hold NULs in it, > but now it is a counted <ptr, len> string. Good. > This suddenly reminded me of strbuf... I don't know if it is worth replacing all s, s_size with strbuf. > > @@ -588,6 +607,10 @@ static int parse_ref_filter_atom(const struct ref_format *format, > > return strbuf_addf_ret(err, -1, _("malformed field name: %.*s"), > > (int)(ep-atom), atom); > > > > + if (format->quote_style && starts_with(sp, "raw")) > > + return strbuf_addf_ret(err, -1, _("--format=%.*s should not combine with" > > + "--python, --shell, --tcl, --perl"), (int)(ep-atom), atom); > > Don't we want to allow "raw:size" that would be a plain text? > I am not sure if this check belongs here in the first place. > Shouldn't it be done in raw_atom_parser() instead? > You are right: "raw:size" should be keep, but I can't this check to raw_atom_parser(), becase "if I move it to raw_atom_parser(), when we use: `git ref-filter --format=%raw --sort=raw --python` Git will continue to run instead of die because parse_sorting_atom() will use a dummy ref_format and don't remember --language details, next time format_ref_array_item() will reuse the used_atom entry of sorting atom in parse_ref_filter_atom(), this will skip the check in raw_atom_parser()." > Another idea is to teach a more generic rule to quote_formatting() > to detect NULs in v->s[0..v->s_size] at runtime and barf, i.e. a > plain-text blob object can be used with "--shell --format=%(raw)" > just fine. > The cost of such a check is not small. Maybe can add an option such as "--only-text" to do it. > > @@ -652,11 +675,14 @@ static int parse_ref_filter_atom(const struct ref_format *format, > > return at; > > } > > > > -static void quote_formatting(struct strbuf *s, const char *str, int quote_style) > > +static void quote_formatting(struct strbuf *s, const char *str, size_t len, int quote_style) > > { > > switch (quote_style) { > > case QUOTE_NONE: > > - strbuf_addstr(s, str); > > + if (len != ATOM_VALUE_S_SIZE_INIT) > > + strbuf_add(s, str, len); > > + else > > + strbuf_addstr(s, str); > > It probably is a good idea to invent a C preprocessor macro for a > named constant to be used when a structure is initialized, but it > would be easier to read if the rule is "len field is negative when > the value is a C-string", e.g. > > if (len < 0) > I do not recognize such an approach because we are deal with "size_t s_size", if (len < 0) will never be established. I use -1 is because it's equal to 18446744073709551615 and it's impossible to have such a large file in Git. > > @@ -785,14 +814,16 @@ static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state > > return 0; > > } > > > > -static int is_empty(const char *s) > > +static int is_empty(struct strbuf *buf) > > { > > - while (*s != '\0') { > > - if (!isspace(*s)) > > - return 0; > > + const char *s = buf->buf; > > + size_t cur_len = 0; > > + > > + while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) { > > s++; > > + cur_len++; > > } > > - return 1; > > + return cur_len == buf->len; > > } > > Assuming that we do want to treat NULs the same way as whitespaces, > the updated code works as intended, which is good. But I have no > reason to support that design decision. I do not have a strong > reason to support a design decision that goes the opposite way to > treat a NUL just like we treat an 'X', but at least I can understand > it (i.e. "because we have no reason to special case NUL any more > than 'X' when trying to see if a buffer is 'empty', we don't"). > This code on the other hand must be supported with "because we need > to special case NUL for such and such reasons for the purpose of > determining if a buffer is 'empty', we treat them the same way as > whitespaces". > Something like "\0abc", from the perspective of the string, it is empty; from the perspective of the memory, it is not empty; I don't know any absolutely good solutions here. > Can the change in this commit violate the invariant that > if_then_else->str cannot be NULL, which seems to have been the case > forever as we see an unchecked strcmp() done in the original? > > If so, perhaps you can check the condition upfront, where you > compute str_len above, e.g. > > if (!if_then_else->str) { > if (if_then_else->cmp_status == COMPARE_EQUAL || > if_then_else->cmp_status == COMPARE_UNEQUAL) > BUG(...); > } else > str_len = strlen(...); > > If not, then I do not see the point of adding this (and later) check > with BUG to this code. > > Or is the invariant that .str must not be NULL could have been > violated without this patch (i.e. the original was buggy in running > strcmp() on .str without checking)? If so, please make it a separate > preliminary change to add such an assert. > The BUG() here actually acts as an "assert()". ".str must not be NULL" is right, it point to "xxx" in "%(if:equals=xxx)", so it seems that these BUG() are somewhat redundant, I will remove them. > > /* See grab_values */ > > -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf) > > +static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned long buf_size, struct object *obj) > > { > > int i; > > const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL; > > @@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf) > > continue; > > if (deref) > > name++; > > - if (strcmp(name, "body") && > > - !starts_with(name, "subject") && > > - !starts_with(name, "trailers") && > > - !starts_with(name, "contents")) > > + > > + if (starts_with(name, "raw")) { > > + if (atom->u.raw_data.option == RAW_BARE) { > > + v->s = xmemdupz(buf, buf_size); > > + v->s_size = buf_size; > > + } else if (atom->u.raw_data.option == RAW_LENGTH) > > + v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size); > > + continue; > > + } > > I can understand that "raw[:<options>]" handling has been inserted > above the existing "from here on, we only deal with log message > components" check. But > > > + if ((obj->type != OBJ_TAG && > > + obj->type != OBJ_COMMIT) || > > I do not see why these new conditions are added. The change is not > justified in the proposed log message, the original did not need > these conditions, and this does not concern the primary point of the > change, which is to start supporting %(raw[:<option>]) placeholder. > > If it is needed as a bugfix (e.g. it may be that you consider "if a > blob has contents that looks very similar to 'git cat-file commit > HEAD', %(body) and friends parse these out, even though it is not a > commit" is a bug and the change to add these extra tests is meant as > a fix), that should be done as a preliminary change before adding > the support for a new atom. > Almost what I means: Make a strong guarantee that blob and tree will never pass the check so that we can don't worry about incorrect parsing in find_subpos(). The reason I put it in this patch is that only commit and tag objects will execute `grab_sub_body_contents()` before, but in the current patch it has changed. > > + (strcmp(name, "body") && > > + !starts_with(name, "subject") && > > + !starts_with(name, "trailers") && > > + !starts_with(name, "contents"))) > > continue; > > if (!subpos) > > find_subpos(buf, > > @@ -1374,25 +1428,30 @@ static void fill_missing_values(struct atom_value *val) > > * pointed at by the ref itself; otherwise it is the object the > > * ref (which is a tag) refers to. > > */ > > -static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf) > > +static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data) > > { > > + void *buf = data->content; > > + unsigned long buf_size = data->size; > > + > > switch (obj->type) { > > case OBJ_TAG: > > grab_tag_values(val, deref, obj); > > - grab_sub_body_contents(val, deref, buf); > > + grab_raw_data(val, deref, buf, buf_size, obj); > > It is very strange that a helper that is named to grab raw data can > still process pieces out of a structured data. The original name is > still a far better match to what the function does, even after this > patch teaches it to also honor %(raw) placeholder. It is still > about grabbing various "sub"-pieces out of "body contents", and the > sub-piece the %(raw) grabs just happens to be "the whole thing". > Well, I can't think of a better name, My original idea was grab_raw_data() can grab itself, header, contents, It is more general than grab_sub_body_contents(), raw data is not a part of "subject" or "body" of "contents"... > > +static int memcasecmp(const void *vs1, const void *vs2, size_t n) > > +{ > > + size_t i; > > + const char *s1 = (const char *)vs1; > > + const char *s2 = (const char *)vs2; > > + > > + for (i = 0; i < n; i++) { > > + unsigned char u1 = s1[i]; > > + unsigned char u2 = s2[i]; > > + int U1 = toupper (u1); > > + int U2 = toupper (u2); > > Does toupper('\0') even have a defined meaning? > > > + int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2 > > + : U1 < U2 ? -1 : U2 < U1); > > Looks crazy to worry about uchar wider than int. Such a system is > not even standard compliant, is it? > Forget about this inelegant help function. As I said in my reply to Felipe, this is copied from gunlib... > > @@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru > > int cmp_detached_head = 0; > > cmp_type cmp_type = used_atom[s->atom].type; > > struct strbuf err = STRBUF_INIT; > > + size_t slen = 0; > > > > if (get_ref_atom_value(a, s->atom, &va, &err)) > > die("%s", err.buf); > > @@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru > > } else if (s->sort_flags & REF_SORTING_VERSION) { > > cmp = versioncmp(va->s, vb->s); > > } else if (cmp_type == FIELD_STR) { > > - int (*cmp_fn)(const char *, const char *); > > - cmp_fn = s->sort_flags & REF_SORTING_ICASE > > - ? strcasecmp : strcmp; > > - cmp = cmp_fn(va->s, vb->s); > > > + if (va->s_size == ATOM_VALUE_S_SIZE_INIT && > > + vb->s_size == ATOM_VALUE_S_SIZE_INIT) { > > + int (*cmp_fn)(const char *, const char *); > > + cmp_fn = s->sort_flags & REF_SORTING_ICASE > > + ? strcasecmp : strcmp; > > + cmp = cmp_fn(va->s, vb->s); > > + } else { > > + int (*cmp_fn)(const void *, const void *, size_t); > > + cmp_fn = s->sort_flags & REF_SORTING_ICASE > > + ? memcasecmp : memcmp; > > Why not introduce two local temporary variables a_size and b_size > and initialize them upfront like so: > > a_size = va->s_size < 0 ? strlen(va->s) : va->s_size; > b_size = vb->s_size < 0 ? strlen(vb->s) : vb->s_size; > > Wouldn't that allow you to do without the complex "if both are > counted, do this, if A is counted but B is not, do that, ..." > cascade? > Sorry, such code would be really ugly for reading. > I can sort-of see the point of special casing "both are traditional > C strings" case (i.e. the "if" side of the "else" we are discussing > here) and using strcasecmp/strcmp instead of memcasecmp/memcmp, but > I do not see much point in having the if/elseif/else cascade inside > this "else" clause. > I will try to modify its logic. > Thanks. Your reply is very accurate. Thanks. -- ZheNing Hu