Re: Speed...

Ron Norman <rjn@xxxxxxxxxxxx> · Wed, 30 Dec 2020 10:40:16 -0500

If you define all of your variables as BINARY-C-LONG then the generated code is closer to your C code.
  l_2:;
  /* Line: 21        : MOVE               : tstspd.bat */
  (*(cob_s64_ptr)(b_9)) = 0;
  (*(cob_s64_ptr)(b_10)) = 0;
  /* Line: 22        : PERFORM            : tstspd.bat */
  (*(cob_s64_ptr)(b_8)) = 0;
  for (;;)
  {
    if ( ((*(cob_s64_ptr)(b_8)) >= 10000000) )
      break;
    /* Line: 23        : IF                 : tstspd.bat */
    if (((int)cob_cmp_s64 (b_10, (*(cob_s64_ptr)(b_11))) == 0))
    {
      /* Line: 24        : ADD                : tstspd.bat */
      cob_add (&f_9, &f_8, 0);
      /* Line: 25        : MOVE               : tstspd.bat */
      memcpy (b_10, b_12, 8);
    }
    else
    {
      /* ELSE */
      /* Line: 27        : SUBTRACT           : tstspd.bat */
      cob_sub (&f_9, &f_8, 0);
      /* Line: 28        : MOVE               : tstspd.bat */
      memcpy (b_10, b_11, 8);
    }
    (*(cob_s64_ptr)(b_8)) = ((*(cob_s64_ptr)(b_8)) + 1);
  }                                                                 

On Wed, Dec 30, 2020 at 10:25 AM Christian Lademann (ZLS) <lademann@xxxxxx> wrote:
Hello Simon,

thank you for these tips.

I've looked at the generated C code. The "killer" seems to be 

cob_move(), which is obviously necessary to move data between different 

variable types and structures. In this case it's the moves "move 1 to k" 

and "move 0 to k". Since source and destination have differing 

structures (pic 9 comp <--> constant 0/1) cob_move() does the job.

But a simple modification of the program changes the timing dramatically:

---(tspeed3.cob)---

        identification division.

        program-id. tspeed3.

        data division.

        working-storage section.

        01 i pic 9(7) comp-5 value 0.

        01 j pic s9(14) comp-5 value 0.

        01 k  pic 9 comp-5 value 0.

        01 k0 pic 9 comp-5 value 0.

        01 k1 pic 9 comp-5 value 1.

        PROCEDURE DIVISION.

        move 0 to j, k

        perform varying i from 0 by 1 until i >= 10000000

          if k = k0

            add i to j

            move k1 to k

          else

            subtract i from j

            move k0 to k

          end-if

        end-perform

        display j upon console

        stop run.

---

$ time ./tspeed3

-00000000000005000000

real    0m0.076s

user    0m0.070s

sys     0m0.003s

---

$ time ./tspeed2

-00000000000005000000

real    0m0.382s

user    0m0.380s

sys     0m0.000s

---

Now the compiler generates a direct assignment insted of a cob_move().

The same could go for the comparison ("if k = 0" <--> "if k = k0") as 

well: In both cases a "cob_cmp_u8()" is generated but in the latter case 

a direct comparison would be possible, just like the direct assignment 

is possible (and way faster).

Wouldn't it be possible to optimize these cases at compile time?

-fnotrunc does this boost as well here, but it might be "too invasive" 

for legacy code, though.

Thank you very much.

Best regards,

Christian

Am 30.12.20 um 15:27 schrieb Simon Sobisch:

> Am 30.12.2020 um 11:48 schrieb Christian Lademann (ZLS):

>> Hello gnucobol team

>>

>> first of all let me point out how happy I am that there is an open

>> source COBOL implementation that is impressing capable and compatible

>> and growing from release to release. I've been experimenting with

>> open-cobol and gnucobol for quite some time now.

> 

> Thanks. We're happy to be able to provide GnuCOBOL :-)

> 

>> While writing some test programs for comparing computational speed, I

>> got the impression that there might be some bottlenecks.

> 

> Of course there are differences as the rules for arithmetic differ

> between C and COBOL.

> 

>> For example:

>>

>> ---(tspeed2.cob)---

>>

>>         identification division.

>>         program-id. tspeed2.

>>         data division.

>>         working-storage section.

>>

>>         01 i pic 9(7) comp-5   value 0.

>>         01 j pic s9(14) comp-5 value 0.

>>         01 k pic 9 comp-5      value 0.

>>

>>         procedure division.

>>

>>         move 0 to j, k

>>         perform varying i from 0 by 1 until i >= 10000000

>>           if k = 0

>>             add i to j

>>             move 1 to k

>>           else

>>             subtract i from j

>>             move 0 to k

>>           end-if

>>         end-perform

>>

>>         display j upon console

>>         stop run.

>> ---

>>

>> $ cobc -x -O2 tspeed2.cob

>> $ time ./tspeed2

>> -000000005000000

>>

>> real    0m0.373s

>> user    0m0.367s

>> sys    0m0.003s

>>

>>

>> ---(c-tspeed2.c)---

>>

>> include <stdio.h>

>>

>> main() {

>>      long i, j;

>>      int k;

>>

>>      j = 0;

>>      k = 0;

>>

>>      for(i = 0; i < 10000000; i++) {

>>          if(k)

>>              j -= i;

>>          else

>>              j += i;

>>

>>          k = !k;

>>      }

>>

>>      printf("j=%ld\n", j);

>> }

>>

>> ---

>>

>> $ make c-tspeed2

>> $ time ./c-tspeed2

>> j=-5000000

>>

>> real    0m0.027s

>> user    0m0.027s

>> sys    0m0.000s

>>

>> The C variant of this benchmark is more than 10 times faster!

>>

>> I know that this just a very limited-scoped look at performance and

>> maybe the comparison is unfair.

> 

> That's actually a reasonable test, the only "unfair" part is that you

> have a very similar test but COBOL "works different" so you're testing

> different things ;-)

> 

>> But I might be missing something.

>>

>> Are there other performance tweaks, besides "-O2"?

> 

> To get the fastest COBOL: ensure that GnuCOBOL itself is heavy optimized:

> 

> * install most current C compiler available

> 

> * build GMP (or MPIR) from source, this normally will produce a library

> that is bound to at least the current CPU features (after changing the

> CPU you'd want to redo that part)

> 

> * get most current GnuCOBOL version, build it from source with

> CFLAGS="-O2 -march=native" and pointing to the installed libgmp/libmpir

> 

> That will normally provide you with a better overall speed, in most

> cases more than cobc -O2 will provide.

> 

> cobc -O2 something.cob always lead to longer compilation times, but the

> generated modules are often not much faster Than with -O; -Os often does

> make a difference in the size of the module; -O0 is sometimes necessary

> if you compile big sources and cobc has to use an outdated compiler (or

> new MSVC...).

> 

> The biggest difference you'll normally see for heavy arithmetic is when

> you explicit disable ANSI truncation rules (also true for other COBOL

> compilers).

> 

> cobc -x -fnotrunc tspeed2.cob

> 

> will give you a much faster result (around 2 times of the C program and

> actually _does_ nearly the same thing as your C program).

> 

> Note: in any case COBOL modules always have the additional workload of

> initializing the GnuCOBOL runtime - which on this machine here takes

> nearly the same time as the complete C program... - so the _actual_

> workload of the COBOL module with -fnotrunc is nearly identical to the C

> program :-)

> 

> To calculate the time needed for initializing the runtime: compile and

> profile an "empty" COBOL program.

> 

> As soon as a COBOL program does "common" io that additional startup time

> does not really count, but for using COBOL as scripting language (which

> is possible with cobc -xj) it would be reasonable to provide a switch to

> prevent most of the runtime initialization (at least the parts that

> check the process environment).

> 

>> Thank you very much.

>>

>> Best regards,

>> Christian.

> 

> You're welcome,

> Simon

> 

> .

-- 

*  Christian Lademann, ZLS Software GmbH         mailto:lademann@xxxxxx

*

*  ZLS Software GmbH

*  Frankfurter Straße 59    Telefon +49-6195-9902-0   mailto:zls@xxxxxx

*  D-65779 Kelkheim         Telefax +49-6195-900600   http://www.zls.de

*

*  Geschäftsführung John A. Shuter

*  Handelsregister  Amtsgericht Königstein, HRB 3105

-- 
CheersRon Norman