As a part of our benchmarking exercise of A5808-32, we started using PGI compilers. After a number of experiments, the following compiler switches seem to give the best performance. Unless specified, the flags must be provided during compilation and linking phase.

-fast: The usual macro for starters. -fast implies -fastsse on 64-bit platforms

-fastsse: Enable SIMD operations.

-Mipa: Enable Interprocedural optimizations. Use as: -Mipa=fast,inline – IPA and automaitc procedure inlining. This enables a two pass compilation and linking.

-Mpfi & -Mpfo: Enable profile guided optimization. -Mpfi enables instrumentation. -Mpfo uses the data collected to guide the optimization.

-Mvect=sse: Enable vectorization of code using SSE

-O<level>: 4 is the highest level of optimization with aggressive techniques

-tp=<target type>: Optimize code for the target processor. Top choices: barcelona, barcelona-64, amd64, amd64e, core2, core2-64

-Munroll: Enable loop unrolling

-Mconcur: autoparallelize loops

-Minline: Inline functions automatically. One can also provide the name of the function to inline.

-mp: Enable recognizing OpenMP directives

-Mloop32: Align innermost loops on 32 byte boundary on Barcelona processors. Small loops run faster with this flag on Barcelona.

