viability to turn on -mavx / -mavx2 by default

Christian Krause <kizkizzbangbang@xxxxxxxxxxxxxx> · Fri, 13 Feb 2015 10:54:58 +0100

Hello,

I have an "Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz" (/proc/cpuinfo), 
which is haswell architecture and includes all of the following flags:

    sse sse2 ssse3 sse4_1 sse4_2 avx avx2

I'm usually compiling with these flags for optimization:

    CFLAGS="-march=native -mfpmath=sse -O2"
    CXXFLAGS="-march=native -mfpmath=sse -O2"

Now to my questions:

1.  Since I have avx and avx2, is it viable to add -mavx / -mavx2 to the 
above-mentioned flags to improve overall optimization? Does 
-march=native enable these automatically?

2.  Would adding these flags improve performance in all / most cases? 
Are there cases where adding -mavx / -mavx2 backfires in terms of e.g. 
performance?

3.  Are these flags worth something, even if the code does not 
explicitly use any avx/2 instructions?

4.  As of the 4.9.2 man page I quote

    > GCC depresses SSEx instructions when -mavx is used. Instead,
    > it generates new AVX instructions or AVX equivalence for all
    > SSEx instructions when needed.

    What happens with -mfpmath=sse when -mavx / -mavx2 is enabled? Or 
does this mean only the -msse* flags are depressed?

    Since -msse2avx is turned on by -mavx automatically are 
-mfpmath=sse instructions also encoded with VEX prefix? Would it not 
make sense to add an option to have -mpfmath=avx?

5.  If I have both avx and avx2, will -mavx2 switch on -mavx 
automatically? (this is not covered by the man page)

To answer some of my questions myself, I ran (see output attached):

    gcc -march=native -mfpmath=sse -O2 -Q --help=target -v

The output shows, that -mavx and -mavx2 are turned on by default with 
-march=native and my haswell architecture. However, -msse2avx is still 
disabled. Shouldn't this be enabled as well since -mavx is enabled? Is 
this a bug?

Best Regards
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /build/gcc-multilib/src/gcc-4.9-20150204/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-cloog-backend=isl --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 4.9.2 20150204 (prerelease) (GCC) 
COLLECT_GCC_OPTIONS='-march=native' '-mfpmath=sse' '-O2' '-Q' '--help=target' '-v'
 /usr/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/cc1 -v help-dummy -march=haswell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=3072 -mtune=haswell -dumpbase help-dummy -mfpmath=sse -auxbase help-dummy -O2 -version --help=target -o /tmp/ccfdt7Ak.s
The following options are target specific:
  -m128bit-long-double        		[disabled]
  -m16                        		[disabled]
  -m32                        		[disabled]
  -m3dnow                     		[disabled]
  -m3dnowa                    		[disabled]
  -m64                        		[enabled]
  -m80387                     		[enabled]
  -m8bit-idiv                 		[disabled]
  -m96bit-long-double         		[enabled]
  -mabi=                      		sysv
  -mabm                       		[enabled]
  -maccumulate-outgoing-args  		[disabled]
  -maddress-mode=             		short
  -madx                       		[disabled]
  -maes                       		[enabled]
  -malign-double              		[disabled]
  -malign-functions=          		0
  -malign-jumps=              		0
  -malign-loops=              		0
  -malign-stringops           		[enabled]
  -mandroid                   		[disabled]
  -march=                     		haswell
  -masm=                      		att
  -mavx                       		[enabled]
  -mavx2                      		[enabled]
  -mavx256-split-unaligned-load 	[disabled]
  -mavx256-split-unaligned-store 	[disabled]
  -mavx512cd                  		[disabled]
  -mavx512er                  		[disabled]
  -mavx512f                   		[disabled]
  -mavx512pf                  		[disabled]
  -mbionic                    		[disabled]
  -mbmi                       		[enabled]
  -mbmi2                      		[enabled]
  -mbranch-cost=              		0
  -mcld                       		[disabled]
  -mcmodel=                   		32
  -mcpu=                      		
  -mcrc32                     		[disabled]
  -mcx16                      		[enabled]
  -mdispatch-scheduler        		[disabled]
  -mdump-tune-features        		[disabled]
  -mf16c                      		[enabled]
  -mfancy-math-387            		[enabled]
  -mfentry                    		[enabled]
  -mfma                       		[enabled]
  -mfma4                      		[disabled]
  -mforce-drap                		[disabled]
  -mfp-ret-in-387             		[enabled]
  -mfpmath=                   		sse
  -mfsgsbase                  		[enabled]
  -mfused-madd                		
  -mfxsr                      		[enabled]
  -mglibc                     		[enabled]
  -mhard-float                		[enabled]
  -mhle                       		[disabled]
  -mieee-fp                   		[enabled]
  -mincoming-stack-boundary=  		0
  -minline-all-stringops      		[disabled]
  -minline-stringops-dynamically 	[disabled]
  -mintel-syntax              		
  -mlarge-data-threshold=     		0x10000
  -mlong-double-128           		[disabled]
  -mlong-double-64            		[disabled]
  -mlong-double-80            		[enabled]
  -mlwp                       		[disabled]
  -mlzcnt                     		[enabled]
  -mmemcpy-strategy=          		
  -mmemset-strategy=          		
  -mmmx                       		[enabled]
  -mmovbe                     		[enabled]
  -mms-bitfields              		[disabled]
  -mno-align-stringops        		[disabled]
  -mno-default                		[disabled]
  -mno-fancy-math-387         		[disabled]
  -mno-push-args              		[disabled]
  -mno-red-zone               		[disabled]
  -mno-sse4                   		[disabled]
  -momit-leaf-frame-pointer   		[disabled]
  -mpc32                      		[disabled]
  -mpc64                      		[disabled]
  -mpc80                      		[disabled]
  -mpclmul                    		[enabled]
  -mpopcnt                    		[enabled]
  -mprefer-avx128             		[disabled]
  -mpreferred-stack-boundary= 		0
  -mprefetchwt1               		[disabled]
  -mprfchw                    		[disabled]
  -mpush-args                 		[enabled]
  -mrdrnd                     		[enabled]
  -mrdseed                    		[disabled]
  -mrecip                     		[disabled]
  -mrecip=                    		
  -mred-zone                  		[enabled]
  -mregparm=                  		0
  -mrtd                       		[disabled]
  -mrtm                       		[disabled]
  -msahf                      		[enabled]
  -msha                    GNU C (GCC) version 4.9.2 20150204 (prerelease) (x86_64-unknown-linux-gnu)
	compiled by GNU C version 4.9.2 20150204 (prerelease), GMP version 6.0.0, MPFR version 3.1.2-p11, MPC version 1.0.2
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
   		[disabled]
  -msoft-float                		[disabled]
  -msse                       		[enabled]
  -msse2                      		[enabled]
  -msse2avx                   		[disabled]
  -msse3                      		[enabled]
  -msse4                      		[enabled]
  -msse4.1                    		[enabled]
  -msse4.2                    		[enabled]
  -msse4a                     		[disabled]
  -msse5                      		
  -msseregparm                		[disabled]
  -mssse3                     		[enabled]
  -mstack-arg-probe           		[disabled]
  -mstack-protector-guard=    		tls
  -mstackrealign              		[enabled]
  -mstringop-strategy=        		[default]
  -mtbm                       		[disabled]
  -mtls-dialect=              		gnu
  -mtls-direct-seg-refs       		[enabled]
  -mtune-ctrl=                		
  -mtune=                     		haswell
  -muclibc                    		[disabled]
  -mveclibabi=                		[default]
  -mvect8-ret-in-mem          		[disabled]
  -mvzeroupper                		[disabled]
  -mx32                       		[disabled]
  -mxop                       		[disabled]
  -mxsave                     		[enabled]
  -mxsaveopt                  		[enabled]

  Known assembler dialects (for use with the -masm-dialect= option):
    att intel

  Known ABIs (for use with the -mabi= option):
    ms sysv

  Known code models (for use with the -mcmodel= option):
    32 kernel large medium small

  Valid arguments to -mfpmath=:
    387 387+sse 387,sse both sse sse+387 sse,387

  Known vectorization library ABIs (for use with the -mveclibabi= option):
    acml svml

  Known address mode (for use with the -maddress-mode= option):
    long short

  Known stack protector guard (for use with the -mstack-protector-guard= option):
    global tls

  Valid arguments to -mstringop-strategy=:
    byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop
    vector_loop

  Known TLS dialects (for use with the -mtls-dialect= option):
    gnu gnu2

COLLECT_GCC_OPTIONS='-march=native' '-mfpmath=sse' '-O2' '-Q' '--help=target' '-v'
 as -v --64 -o /tmp/cc5v6wIa.o /tmp/ccfdt7Ak.s
GNU assembler version 2.25.0 (x86_64-unknown-linux-gnu) using BFD version (GNU Binutils) 2.25.0