Re: [PATCH 00/11] pragma once: treewide conversion

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 5 Mar 2021 13:23:38 -0800

On Fri, Mar 5, 2021 at 1:19 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> The point is that you can skip the unwanted parts of
> #if without having to parse the file at all.
> You just need to detect the line breaks.

That's not actually true AT ALL.

You still need to at the very least parse the preprocessor tokens,
looking for things like #if, #else, and #endif. Because those can -
and do - nest inside the whole thing, so you're not even looking for
the final #endif, you have to be aware that there might be new #if
statements that means that now you now have to increment the nesting
count for the endif.

And to do even just *THAT*, you need to do all the comment parsing,
and all the string parsing, because a #endif means something entirely
different if there was a "/*"  or a string on a previous line that
hasn't been terminated yet (yes, unterminated strings are bad
practice, but ..).

And regardless of even _those_ issues, you still should do all the
other syntactic tokenization stuff (ie all the quoting, the the
character handling: 'a' is a valid C token, but if you see the string
"it's" outside of a comment, that's a syntax error even if it's inside
a disabled region. IOW, this is an incorrect file:

   #if 0
   it's a bug to do this, and the compiler should scream
   #endif

because it's simply not a valid token sequence. The fact that it's
inside a "#if 0" region doesn't change that fact at all.  So you did
need to do all the tokenization logic.

The same goes for all the wide string stuff, the tri-graph horrors, etc etc.

End result: you need to still do basically all of the basic lexing,
and while you can then usually quickly throw the result mostly away
(and you could possibly use a simplified lexer _because_ you throw it
away), you actually didn't really win much. Doing a specialized lexer
just for the disabled regions is probably simply a bad idea: the fact
that you need to still do all the #if nesting etc checking means that
you still do need to do a modicum of tokenization etc.

Really: the whole "trivial" front-end parsing phase of C - and
particularly C++ - is a huge huge deal. It's going to show in the
profiles of the compiler quite decisively, unless you have a compiler
that then spends absolutely insane time on optimization and tries to
do things that basically no sane compiler does (because developers
wouldn't put up with the time sink).

So yes, I've even used things like super-optimizers that chew on small
pieces of code for _days_ because they have insane exponential costs
etc. I've never done it seriously, because it really isn't realistic,
but it can be a fun exercise to try.

Outside of those kinds of super-optimizers, lexing and parsing is a
big big deal.

(And again - this is very much language-specific.  The C/C++ model of
header files is very very flexible, and has a lot of conveniences, but
it's also a big part of why the front end is such a big deal. Other
language models have other trade-offs).

             Linus