Re: [PATCH 1/3] config.txt: describe whitespace characters further and more accurately

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-03-14 02:18, Junio C Hamano wrote:
Dragan Simic <dsimic@xxxxxxxxxxx> writes:
-The syntax is fairly flexible and permissive; whitespaces are mostly
-ignored. The '#' and ';' characters begin comments to the end of line,
-blank lines are ignored.
+The syntax is fairly flexible and permissive.  Whitespace characters,
+which in this context are the space character (SP) and the horizontal
+tabulation (HT), are mostly ignored. The '#' and ';' characters begin
+comments to the end of line.  Blank lines are ignored.

OK, except for "whitespace characters"---do we need to say
"whitespace characters", after we already listed HT and SP are the
ones, instead of just "whitespaces"?

I also spent some time thinking about that. To me, the plural form, i.e. "whitespaces", simply doesn't sound very good, because "whitespace" feels
to me more like a mass noun, and I really haven't seen it used in plural
form in other projects.

 A line that defines a value can be continued to the next line by
+ending it with a `\`; the backslash and the end-of-line are stripped.
+Leading whitespace characters after 'name =', the remainder of the
 line after the first comment character '#' or ';', and trailing
+whitespace characters of the line are discarded unless they are enclosed +in double quotes. This discarding of the trailing whitespace characters +also applies after the remainder of the line after the comment character
+is discarded.

"also" makes it sound as if we do it twice, once to remove trailing
whitespaces after the remainder of the line after '#", and then trim
the trailing whitespaces after we removed the comment.

Good point, I also felt it the same way, but went with such wording simply
because I thought it should be more understandable to the users, despite
being technically a bit incorrect.

I wonder if we can make it clearer by following the step-by-step
nature of the earlier part of the paragraph through.  We already say
the folded line processing is done first, so break things down in
conceptual phases/steps, perhaps like

 * The backslash at the end-of-line is removed, together with the
   end-of-line, to form a single long line.

 * Anything that come after the first unquoted comment character,
   either '#' or ';', are discarded.

 * The leading and trailing whitespaces around the value part
   (i.e. what follows 'name =') are discarded.

 * Remaining unquoted whitespaces inside the value part are munged.

Hmm, I'm not really sure that such a description would be more clear
to the users, despite being technically more correct.  I'll think a bit
more about it.

+Any number of internal whitespace characters found within
+the value are converted to the same number of space (SP) characters.

The last one sounds like a bug to me, by the way.

At least the very original 17712991 (Add ".git/config" file parser,
2005-10-10) squashed a run of whitespace characters into a single
SP, which makes sense as a "clean-up".

But ebdaae37 (config: Keep inner whitespace verbatim, 2009-07-30),
while claiming to "Keep" inner whitespaces, broke it by replacing
any isspace() bytes that are not SP with SP, contradicting its
stated purpose.

Thank you for the investigation.  The ebdaae37 commit certainly
introduced a bug to the value parsing, which presumably has remained
undected because the included test passes.

The way I see it, fixing the bug may actually be a breaking change,
because some user configurations may actually rely on the current
(mis)behavior.  This makes me somewhat afraid that fixing this bug,
which I already thought about, may actually do more harm than good.
However, fixing this bug seems to be only right thing to do, which
I'll explain further below.

As the latest change by the author of that change is from more than
10 years ago, I do not expect that he is still interested in this
part of the codebase, but thanks to a very clearly written log
message, we can read what the motivation behind that change was, and
seeing that what the code does contradicts with the stated
motivation we can safely declare that this is an ancient bug.

Agreed, the evidence is clear.

Fixing that bug can of course be left outside the series.  For those
who are looking for microproject ideas who discovered this message
by searching for the #leftoverbits keyword, one possible fix would
be to revert ebdaae37, make sure a value with any whitespace in it
gets quoted, and document clearly that an unquoted run of
whitespaces is squashed into a single SP.  Another way that is
milder is to finish what ebdaae37 wanted to do and retain the
whitespaces "verbatim".

I already though about fixing the bug so the value parser actually does
what git-config(1) currently says, but as I already noted above, I'm
afraid a bit that fixing this bug may actually do more harm than good.

Though, further investigation shows that setting a configuration value,
by invoking git-config(1), converts value-internal tabs into "\t" escape
sequences, which the value-parsing logic doesn't "squash" into spaces.
That's why the test included in the ebdaae37 commit passes. On the other
hand, value-internal literal tab characters, found in a configuration
file, do get "squashed" by the value-parsing logic, so I'd say that the
only right thing to do is to fix this bug by making the value-internal
whitespace characters preserved verbatim.

I'd be happy to include the bugfix into this series, if my above-mentioned
fears prove to be unnecessary.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux