On Thu, Mar 29, 2018 at 05:19:00PM +0200, Johannes Schindelin wrote: > Let's generalize this observation to this conservative strategy: if we > are removing the last entry from a section, and there are no comments > inside that section nor surrounding it, then remove the entire section. > Otherwise behave as before: leave the now-empty section (including those > comments, even the one about the now-deleted entry). Yep, as I said earlier, this makes a ton of sense to me. > +/* > + * This function determines whether the offset is in a line that starts with a > + * comment character. > + * > + * Note: it does *not* report when a regular line (section header, config > + * setting) *ends* in a comment. > + */ > +static int is_in_comment_line(const char *contents, size_t offset) > +{ > + int comment = 0; > + > + while (offset > 0) > + switch (contents[--offset]) { > + case ';': > + case '#': > + comment = 1; > + break; > + case '\n': > + break; > + case ' ': > + case '\t': > + continue; > + default: > + comment = 0; > + } > + > + return comment; > +} This doesn't pay any attention to quoting, so I wondered if it would get fooled by a line like: key = "this content has a # comment in it" or even: [section "this section has a # comment in it"] but those don't count because the line doesn't _start_ with the comment character. Could we design one that does? This isn't valid: [section] key = multiline \ # with comment But I think this is: [section] key = "multiline \ # with comment" So let's see if we can fool it: -- >8 -- cat >file <<-\EOF [one] key = "multiline \ # with comment" [two] key = true EOF # should produce "multiline # with comment" ./git config --file=file one.key # this should ideally remove the section ./git config --file=file --unset two.key cat file -- 8< -- That seems to work as expected. I'm not 100% sure why, though, since I thought we'd hit the "seen_section && !is_in_comment_line" bit of the look_before loop. Running it through gdb, I'm not convinced that is_in_comment_line is working correctly, though. Shouldn't it stop when it sees the newline, and return "comment"? There's a "break" there, but it doesn't break out of the loop due to the switch statement. So we'll _always_ walk back to the beginning of file. So I suspect your test passes because it does: # this is the start of the file [section] key = true but: [anotherSection] key = true # a comment not at the start [section] key = true does the wrong thing, and removes [section]. If we fix that bug like this: diff --git a/config.c b/config.c index b04c40f76b..3b2c7e9387 100644 --- a/config.c +++ b/config.c @@ -2461,7 +2461,7 @@ static int is_in_comment_line(const char *contents, size_t offset) comment = 1; break; case '\n': - break; + return comment; case ' ': case '\t': continue; then it keeps "[section]" correctly. But now if we go back to our funny multiline example, it does the wrong thing (it keeps [two], even though that's not _really_ a comment). To be honest, I could live with that as an open bug. It's a pretty ridiculous situation, and the worst case is that we err on the side of caution and don't remove the section. And I think it would be hard to fix. We could look for the continuation backslash when we find the newline, but that gets fooled by: # a comment \ # with a pointless backslash You can't just notice the quote and say "oh, I'm in a quoted section" because that gets fooled by: # a pointless "quote To know whether that quote is valid or not, you have to find the other quote. But doing that backwards is hard (if not impossible). > +static void maybe_remove_section(const char *contents, size_t size, > + const char *section_name, > + size_t section_name_len, > + size_t *begin, int *i_ptr, int *new_line) > +{ > + size_t begin2, end2; > + int seen_section = 0, dummy, i = *i_ptr; > + > + /* > + * First, make sure that this is the last key in the section, and that > + * there are no comments that are possibly about the current section. > + */ > +next_entry: > + for (end2 = store.offset[i]; end2 < size; end2++) { > + switch (contents[end2]) { > + case ' ': > + case '\t': > + case '\n': > + continue; > + case '\r': > + if (++end2 < size && contents[end2] == '\n') > + continue; > + break; > + case '[': > + /* If the section name is repeated, continue */ > + if (end2 + 1 + section_name_len < size && > + contents[end2 + section_name_len] == ']' && > + !memcmp(contents + end2 + 1, section_name, > + section_name_len)) { > + end2 += section_name_len; > + continue; > + } > + goto look_before; > + case ';': > + case '#': > + /* There is a comment, cannot remove this section */ > + return; > + default: > + /* There are other keys in that section */ > + break; > + } OK, this all makes sense. We're scanning forward to find the next '[', without finding any keys or comments. We don't have to worry about quoting because we'd quit as soon as we see a key anyway. I like the special-case for finding our same section name, since that would help clean up cruft from existing versions of Git. It looks like there may be an off-by-one, though. Should it be checking: contents[end2 + 1 + section_name_len] == ']' to skip over the opening '['? In a simple example: [foo] bar = true [foo] we don't seem to remove the second section header. It works with the patch below: diff --git a/config.c b/config.c index b04c40f76b..48dcb52840 100644 --- a/config.c +++ b/config.c @@ -2508,10 +2508,10 @@ static void maybe_remove_section(const char *contents, size_t size, case '[': /* If the section name is repeated, continue */ if (end2 + 1 + section_name_len < size && - contents[end2 + section_name_len] == ']' && + contents[end2 + 1 + section_name_len] == ']' && !memcmp(contents + end2 + 1, section_name, section_name_len)) { - end2 += section_name_len; + end2 += section_name_len + 1; continue; } goto look_before; Unfortunately I think this whole thing breaks down with subsections. If we try this: [foo "subsection"] bar = true [foo "subsection"] then the section_name variable contains "foo.subsection", which we can't textually match. And we end up failing to remove either section (the latter one because of this loop, and the former because of the same problem in the look_before loop). > + /* > + * Uh oh... we found something else in this section. But do > + * we want to remove this, too? > + */ > + if (++i >= store.seen) > + return; > + > + begin2 = find_beginning_of_line(contents, size, store.offset[i], > + &dummy); > + if (begin2 > end2) > + return; > + > + /* Looks like we want to remove the next one, too... */ > + goto next_entry; > + } OK, makes sense. > +look_before: > + /* > + * Now, ensure that this is the first key, and that there are no > + * comments before the entry nor before the section header. > + */ > + for (begin2 = *begin; begin2 > 0; ) > + switch (contents[begin2 - 1]) { > + case ' ': > + case '\t': > + begin2--; > + continue; > + case '\n': > + if (--begin2 > 0 && contents[begin2 - 1] == '\r') > + begin2--; > + continue; > + case ']': > + if (begin2 > section_name_len + 1 && > + contents[begin2 - section_name_len - 2] == '[' && > + !memcmp(contents + begin2 - section_name_len - 1, > + section_name, section_name_len)) { > + begin2 -= section_name_len + 2; > + seen_section = 1; > + continue; > + } OK, this is the backwards mirror image of the earlier part. Which makes sense. And this handles the reverse case for the doubled section name: [foo] [foo] bar = true because we'd hit this section-name check twice, and just set "seen_section = 1" both times. So that works (modulo the subsection parsing thing). As far as quoting goes, now we're coming from the back of each line now. And I don't think we strictly require double-quotes around string values. So imagine this: [one] foo = this has [brackets] bar = this does not When deleting one.bar, we'd erroneously think that closing bracket is the prior section header. I _think_ it behaves correctly, though, because we then say "well, delete everything back to that bracket character". Which happens to be the correct thing to do anyway. But let's get more devious. What about this: [one] foo = fake section [one] bar = whatever If I unset foo.bar with your patch, I end up with the truncated: [one] foo = fake sectio Yikes. This is obviously a ridiculous example, but the failure case is pretty nasty. Again, the tricky thing here is that we're parsing backwards. We don't know what's syntactically relevant and what isn't. > + > + /* > + * It looks like a section header, but it could be a > + * comment instead... > + */ > + if (is_in_comment_line(contents, begin2)) > + return; This would get fooled if we allowed line continuation in subsection names, like: [one "subsection\ # with newline"] key = true but it looks like our parser doesn't allow that (aside from it being slightly insane, of course). Good. > + /* > + * We encountered the previous section header: This > + * really was the only entry, so remove the entire > + * section. > + */ > + if (contents[begin2] != '\n') { > + begin2--; > + *new_line = 1; > + } > + > + store.offset[i] = end2; > + *begin = begin2; > + *i_ptr = i; > + return; OK, makes sense. > + default: > + /* > + * Any other character means it is either a comment or > + * a config setting; if it is a comment, we do not want > + * to remove this section. If it is a config setting, > + * we only want to remove this section if this is > + * already the next section. > + */ > + if (seen_section && > + !is_in_comment_line(contents, begin2)) { > + if (contents[begin2] != '\n') { > + begin2--; > + *new_line = 1; > + } > + > + store.offset[i] = end2; > + *begin = begin2; > + *i_ptr = i; > + } > + return; > + } Here's where we get fooled by is_in_comment_line() that I showed at the beginning. We don't have to worry about other quoting, because any key (quoted or not) would cause us to abort, since it's in the section. > + /* This section extends to the beginning of the file. */ > + store.offset[i] = end2; > + *begin = begin2; > + *i_ptr = i; > +} Right, makes sense. Ok, phew. That was a tough read. So here's what I see: 1. Minor bug in is_in_comment_line(), patch above. 2. Minor bug in matching section names, patch above. 3. Matching subsection names doesn't work. I think this should be fixable with a helper function which can match '[one "two"]' when given "one.two". 4. Backwards parsing causes is_in_comment_line to trigger more than it should. I can live with that because the trigger is arcane, and the error behavior is pretty harmless. 5. Backwards parsing can find a bogus section. Also arcane, but the error behavior is pretty scary. (4) and (5) are the ones that I don't see a way to fix, given the current way in which we do the config-writing (i.e., running it through the regular read-parser and then trying to "patch up" the found locations). I think that's also what's contributing to the code being hard to read, since you end up doing quite a lot of manual re-parsing. I think the sane way to do this would be to parse the whole thing into a tree (that includes things like comments and whitespace), and then we could much more easily manipulate that tree, without dealing with the parsing (forwards _and_ backwards). But that's a pretty big change from the current code. It also potentially means duplicating the parsing logic, unless we teach the regular reader to do the tree-parse, and then pick out the config from that. That's likely much slower than the existing parser (since we'd allocate a bunch of tree nodes instead of just dumping strings to the callbacks). But these days we cache the parsed config anyway, so I'm not sure if a slight slowdown would actually matter that much. I guess the holy grail would be a parser which reports _all_ syntactic events (section names, keys, comments, whitespace, etc) as a stream without storing anything. And then the normal reader could just discard the non-key events, and the writer here could build the tree from those events. -Peff