On Fri, Mar 5, 2021 at 1:19 AM David Laight <David.Laight@xxxxxxxxxx> wrote: > > The point is that you can skip the unwanted parts of > #if without having to parse the file at all. > You just need to detect the line breaks. That's not actually true AT ALL. You still need to at the very least parse the preprocessor tokens, looking for things like #if, #else, and #endif. Because those can - and do - nest inside the whole thing, so you're not even looking for the final #endif, you have to be aware that there might be new #if statements that means that now you now have to increment the nesting count for the endif. And to do even just *THAT*, you need to do all the comment parsing, and all the string parsing, because a #endif means something entirely different if there was a "/*" or a string on a previous line that hasn't been terminated yet (yes, unterminated strings are bad practice, but ..). And regardless of even _those_ issues, you still should do all the other syntactic tokenization stuff (ie all the quoting, the the character handling: 'a' is a valid C token, but if you see the string "it's" outside of a comment, that's a syntax error even if it's inside a disabled region. IOW, this is an incorrect file: #if 0 it's a bug to do this, and the compiler should scream #endif because it's simply not a valid token sequence. The fact that it's inside a "#if 0" region doesn't change that fact at all. So you did need to do all the tokenization logic. The same goes for all the wide string stuff, the tri-graph horrors, etc etc. End result: you need to still do basically all of the basic lexing, and while you can then usually quickly throw the result mostly away (and you could possibly use a simplified lexer _because_ you throw it away), you actually didn't really win much. Doing a specialized lexer just for the disabled regions is probably simply a bad idea: the fact that you need to still do all the #if nesting etc checking means that you still do need to do a modicum of tokenization etc. Really: the whole "trivial" front-end parsing phase of C - and particularly C++ - is a huge huge deal. It's going to show in the profiles of the compiler quite decisively, unless you have a compiler that then spends absolutely insane time on optimization and tries to do things that basically no sane compiler does (because developers wouldn't put up with the time sink). So yes, I've even used things like super-optimizers that chew on small pieces of code for _days_ because they have insane exponential costs etc. I've never done it seriously, because it really isn't realistic, but it can be a fun exercise to try. Outside of those kinds of super-optimizers, lexing and parsing is a big big deal. (And again - this is very much language-specific. The C/C++ model of header files is very very flexible, and has a lot of conveniences, but it's also a big part of why the front end is such a big deal. Other language models have other trade-offs). Linus