Re: Compilation of lengthy C++ Files

Kai Song via Gcc-help <gcc-help@xxxxxxxxxxx> · Thu, 19 Oct 2023 17:11:06 +0200

Dear everyone,

Now CLANG does compile the lengthy code of 600k LOC into a functioning
object that can be linked and yields the expected result.
Clang was just reluctant to make a certain pointer type-cast by itself.

Given that CLANG compiles the object, what can be inferred?

Further questions:
- Are there any pragmas that can help the compiler identify code portions
with linear compile-complexity?
- What compiler flag combination should I try to increase chances at
successful compilation of lengthy files?

@Jonathan, I am happy to clarify:

> >
> > 1) Will any generated application be limitable into a certain volume of
> > source-code? Certainly not, so why even argue on one particular instance
> of
> > volume?
>
> I have no idea what this means.
>

Is every program on earth writable into X lines of code? Then, which value
is X?
If X exists and is known, one can discuss whether source code of length Y
should exist.

> > 2) May it necessitate significant amounts of work to enable a
> > code-generator to produce shorter code? Certainly.
>
> Well at the moment your code doesn't even compile. So you can keep
> generating billion line functions that don't work, or you could try
> refactoring it. It might not be easy, but if it compiles and works,
> surely that's better than billion line functions that don't work?
>

But certainly, executing a code of length N is an easier problem (linear in
N) than figuring out whether a code of length N can be reduced into length
M (exponentially hard in N).
Also, being able to execute a code of length N is by far more useful than
knowing for one niche class of problems by how much N can be reduced, not
even mentioning whether M is smaller enough..

> > 3) Suppose a code-generator exploits structure of a given user-provided
> > problem instance for generation of shorter code. Will this mean the set
> of
> > feasible problems treatable by the code-generator shinks? Likely.
>
> I don't see why that should be true.
>

In order to seek a structure that the code-generator would track from the
user-provided instance, that information-processor would have to be
implemented, verified, executed, and function reliably.
Whenever the structure does not apply, the benefit cannot be stroked. Then
still the program length won't reduce and all effort was in vain.

In the context of information theory, the general hyperbola applies that
then attempting to solve problems more "smartly", this results in a solver
that is more efficient but less robust.
Efficacy, robustness, and genericity do always form a Pareto curve. Think
of higher-order numerical methods, which always add overhead while only
paying off when instances are sufficiently smooth.
Or think of advanced alpha-beta-pruning-schemes that only work beneficially
for highly structured niche problems, such as chess.

>
> > 4) Is 99% of the code trivial and should be compilable in read-time? I
> > believe so, given my naive information.
>
> A billion lines of trivial operations still consumes ridiculous
> amounts of resources to compile. Your naive view doesn't seem
> relevant.
>

I am a prisoner of myself, so I must respect that you say I cannot know.
I do know that I would know how to compute my particular code in linear
time of length with pencil and paper -- just my human machine-speed is 9
orders of magnitude too slow.

> > 5) Should months of work be invested into generating shorter code in
> order
> > for that to be able to compile in 5 minutes when really this code is
> > compiled and used only once in a lifetime? I am unsure about this.
> >
> > The stinging issue of argument may be on aspect 4), which makes it seem
> as
> > though the other 99% should be wrappable into functions.
> > The, if we call it so, mis-expectation is that then clean interfaces to
> > these functions are generatable or may even not exist. To give an
> example,
> > suppose a skein of code-strings in which each invoked 1% non-trivial code
> > is changing one string so that it induces chaos into the naming of all
> > other temporary variable names, causing mismatch between contiguity
> between
> > data at code-gen time and generated-code time. That is the central issue
> > that I --admittedly-- brute-force by committing into lengthy code.
>
> I don't understand this either, sorry.
>

Argument 5 was on the proportionality between work time and compile time. I
should rather spend 5 minutes into coding and 5 months into compiling than
1 week into coding and 5min into compiling, unless the code must be
compiled and run multiple times.

The elaboration on four gives a plastical description of code pattern that
results in chaos, supporting the hypothesis that the problem of reducing a
program of length N is in O(exp(N)) whereas running it is in O(N).

> Nothing you have said actually explains why you can't refactor the
> code into utility functions that avoid doing everything in one huge
> function.
>

Because there are no two pieces of code that are exactly identical. The
relative distance of two variables being involved into an identical formula
will change with every line.
Example:
Line 1000: tmp[100]=foo(tmp[101],tmp[102]);
Line 2000: tmp[200]=foo(tmp[201],tmp[203]); // dang it, not tmp[202] but
tmp[203]
It is like with penrose tiling. It seems all identical but the details
break it. You just do not find two identical local areas. No-where. And if,
you have to particularly search for them by brute-force, and that should
become useless whenever the particular pattern you try to find does just
not exist in that particular user's instance.

>
> >
> > >> Jonathan: Have you tried using clang instead?
> >
> > I tried ICC (won't install successfully on my Windows 10 PC) and then
> CLANG
> > 17.0.0 .
> >
> > CLANG requires a provided STL, so I used the ones from Microsoft Visual
> > Studio 2022, which I referenced via the -isystem flag.
> > In order to compile it, I had to make tons of changes that g++ would not
> > have complained about; like putting a "d" at the end of a floating-point
> > value.
>
> Eh?!
>
> You must be doing something wrong. Clang does not require that.
>

As I now found, ICC might now work because my CPU is AMD.
I was told LLVM does not come with its own STL. I understand you say it
does.
While g++ accepts Tfloat a=0.1d; the same appeared untrue for CLANG.

Kind regards,
Kai