Archive for November, 2007

Memory Latency on AMD Opteron 2354

Tuesday, November 6th, 2007

In the continuing posts regarding our benchmarking exercise, we now share the memory latencies on AMD Opteron 2354.

 The setup is essentially same as described in the previous posts. I will refrain from detailing the same for brevity.

LMBench 3.0 Alpha 8 was used to measure the memory latencies.

Here are the numbers:

L1 cache: 1.366 ns

L2 cache: 5.99 ns

Main Memory: 89.1

Random Memory: 184.0

The latencies look good so far.  The main memory latency is little bit higher than the latency from AMD Opteron 22xx series. However, Opteron 23xx series has an additional shared L3 cache of 2 MB. From other reviews on the web, it look slike this additional L3 cache is adding the latency.

Its the first cut … More numbers too come soon.

memory bandwidth on AMD Opteron 2354

Monday, November 5th, 2007

We got our hands on a new mainboard supporting the split plane (Dual Dynamic Power Management) feature of AMD Opteron quad core (Barcelona) processors. The earlier mainboards do support Barcelona fully but not the split plane feature. Due to this, the memory controller on the Barcelona and the L2 cache run at a slower clock than on a split plane board. Slower clock rate implies lower memory bandwidth and incerased latency compared to the same processor on a split place board.

 Well, this could a great opportunity to test what improvements does the split plane offers in terms of memory performance.

 The test system is setup as follows:

HPC Systems, Inc. A1204

Dual AMD Opteron 2354

8 X 1 GB DDR2 667 MHz


Western Digital 250 GB SATA hard drive

SUN Studio 12

STREAM benchmark

Problem size: N = 20000000

Compiler command used

suncc -fast -xO4 -xprefetch -xprefetch_level=3 -xvector=simd -xarch=sse3 -xdepend  -m64 -xopenmp -o stream.big ../stream.c  -xlic_lib=sunperf -I../

Performance for 1 thread (compiled without -xopenmp flag):

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        5724.3136       0.0559       0.0559       0.0559
Scale:       6077.3024       0.0527       0.0527       0.0527
Add:         5692.4606       0.0843       0.0843       0.0844
Triad:       5696.1831       0.0843       0.0843       0.0843
Solution Validates

We did see a higher bandwidth number with PGI compilers … close to 6.5 GB/s but we are unable to post the result becasue the license has expired for the binaries compiled with PGI compilers.

Performance for 4 threads:

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       12230.5392       0.0262       0.0262       0.0262
Scale:      12099.2614       0.0265       0.0264       0.0265
Add:        11536.8169       0.0417       0.0416       0.0417
Triad:      11543.9895       0.0417       0.0416       0.0418
Solution Validates

Performance for 8 threads:

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       17516.0718       0.0183       0.0183       0.0183
Scale:      17382.8602       0.0184       0.0184       0.0185
Add:        16455.8826       0.0292       0.0292       0.0293
Triad:      16519.7865       0.0291       0.0291       0.0291
Solution Validates

 From the numbers, we seem to have hit the same performance as advertised on AMD web site.

The peak bandwidth of a 2P AMD Opteron system is 21.2 GB/s. We achieved a sustained of 17.5 GB/s i.e a sustained value of 82%

Here are the results with only one socket populated. This exercise is important to eliminate the issues of how the memory is allocated across sockets and also the issue of threads scheduled on different sockets.

Performance for 1 threads (compiled without -xopenmp flag) :

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        6256.7322       0.0528       0.0527       0.0528
Scale:       6417.2126       0.0499       0.0499       0.0499
Add:         6306.9054       0.0761       0.0761       0.0762
Triad:       6333.5465       0.0758       0.0758       0.0758
Solution Validates

Performance for 4 threads :

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        9148.0695       0.0350       0.0350       0.0351
Scale:       9080.6064       0.0353       0.0352       0.0353
Add:         8510.1783       0.0565       0.0564       0.0565
Triad:       8511.8559       0.0564       0.0564       0.0565
Solution Validates

That is about 9.1 GB/s sustained from a peak of 10.1 Gb/s, i.e 90% efficiency

PGI Compiler 7.1 (7.1-1) and bundled ACML for Barcelona

Thursday, November 1st, 2007

I am using PGI 7.1 compilers for my benchmakring exercise. The compiler includes an ACML version bundled with it and the compiler supports AMD Opteron Quadcore Barcelona. Naturally, I did not think twice and started linking with ACML provided with the compiler. 

The best DGEMM number I got was about 53% of the peak. That does not seem right. However, the same ACML version did provide a DGEMM value as high as 87% on AMD Opteron dual core.  

Ater wasting a some time and efforts, I downloaded the ACML from AMD Developer Central. Linking BLASBench with this new ACML, I was able to get a DGEMM value that was about 87% of the peak.

Maybe this post will save you some time if you are using ACML with PGI compilers.

Please note: You need to provide the following libraries to the linker if you are linking with C compiler pgcc: -lrt -lpgftnrtl when linking with ACML from AMD developer site.

-lpgftnrtl links FORTRAN runtime with the code.

If you are using FORTRAN to link the code, pgf77, it is not needed to provide -lpgftnrtl

If you are linking with FORTRAN compiler but the main() is in a C file, provide -Mnomain to the linker.

Missing -Mnomain will throw up the following error:

bb.o: In function `main’:
bb.c:(.text+0xde0): multiple definition of `main’
/opt/pgi/linux86-64/7.1-1/lib/pgfmain.o:pgfmain.c:(.text+0×0): first defined here
/usr/bin/ld: Warning: size of symbol `main’ changed from 79 in /opt/pgi/linux86-64/7.1-1/lib/pgfmain.o to 13982 in bb.o
/opt/pgi/linux86-64/7.1-1/lib/pgfmain.o: In function `main’:
pgfmain.c:(.text+0×34): undefined reference to `MAIN_’

using C compiler, pgcc, to link the code and failing to provide -lpgftnrtl will result in the following error:

/opt/acml4.0.0/pgi64/lib/libacml.a(dgemv.o): In function `dgemv.pgi.uni.1_’:
dgemv.F:(.text+0×508): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgemv.o): In function `dgemv.pgi.uni.2_’:
dgemv.F:(.text+0×1518): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgemv.o): In function `sgemv.pgi.uni.1_’:
sgemv.F:(.text+0×4eb): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgemv.o): In function `sgemv.pgi.uni.2_’:
sgemv.F:(.text+0xfd0): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(xerbla.o): In function `xerbla.pgi.uni.1_’:
xerbla.f:(.text+0×5f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×74): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×90): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0xa3): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0xa8): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0xb1): undefined reference to `ftn_stop’
xerbla.f:(.text+0xe2): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×11b): undefined reference to `fio_src_info’
xerbla.f:(.text+0×139): undefined reference to `fio_fmtr_intern_init’
xerbla.f:(.text+0×152): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×16b): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×184): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×19d): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×1a2): undefined reference to `fio_fmtr_end’
xerbla.f:(.text+0×1fe): undefined reference to `fio_src_info’
xerbla.f:(.text+0×215): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×228): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×240): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×245): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×25d): undefined reference to `fio_src_info’
xerbla.f:(.text+0×274): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×287): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×28c): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×2a7): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×2c1): undefined reference to `fio_src_info’
xerbla.f:(.text+0×2d8): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×2f4): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×310): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×315): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×34f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×36d): undefined reference to `fio_fmtw_intern_init’
xerbla.f:(.text+0×385): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×39d): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3b5): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3cd): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3d2): undefined reference to `fio_fmtw_end’
/opt/acml4.0.0/pgi64/lib/libacml.a(xerbla.o): In function `xerbla.pgi.uni.2_’:
xerbla.f:(.text+0×46f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×484): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×4a0): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×4b3): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×4b8): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×4c1): undefined reference to `ftn_stop’
xerbla.f:(.text+0×4f2): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×52b): undefined reference to `fio_src_info’
xerbla.f:(.text+0×549): undefined reference to `fio_fmtr_intern_init’
xerbla.f:(.text+0×562): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×57b): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×594): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×5ad): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×5b2): undefined reference to `fio_fmtr_end’
xerbla.f:(.text+0×60e): undefined reference to `fio_src_info’
xerbla.f:(.text+0×625): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×638): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×650): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×655): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×66d): undefined reference to `fio_src_info’
xerbla.f:(.text+0×684): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×697): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×69c): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×6b7): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×6d1): undefined reference to `fio_src_info’
xerbla.f:(.text+0×6e8): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×704): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×720): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×725): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×75f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×77d): undefined reference to `fio_fmtw_intern_init’
xerbla.f:(.text+0×795): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7ad): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7c5): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7dd): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7e2): undefined reference to `fio_fmtw_end’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgeblkmatS.o): In function `dgeblkmats.pgi.uni.1_’:
dgeblkmatS.f:(.text+0×80): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgeblkmatS.o): In function `dgeblkmats.pgi.uni.2_’:
dgeblkmatS.f:(.text+0×480): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgeblk2matS.o): In function `sgeblk2mats.pgi.uni.1_’:
sgeblk2matS.f:(.text+0×7b): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgeblk2matS.o): In function `sgeblk2mats.pgi.uni.2_’:
sgeblk2matS.f:(.text+0×50b): undefined reference to `ftn_str_index’
child process exit status 1: /usr/bin/ld

Compiler Optimizations for AMD Opteron Quadcore Barcelona with PGI compilers

Thursday, November 1st, 2007

As a part of our benchmarking exercise of A5808-32, we started using PGI compilers. After a number of experiments, the following compiler switches seem to give the best performance. Unless specified, the flags must be provided during compilation and linking phase.

-fast: The usual macro for starters. -fast implies -fastsse on 64-bit platforms

-fastsse: Enable SIMD operations.

-Mipa: Enable Interprocedural optimizations. Use as: -Mipa=fast,inline – IPA and automaitc procedure inlining. This enables a two pass compilation and linking.

-Mpfi & -Mpfo: Enable profile guided optimization. -Mpfi enables instrumentation. -Mpfo uses the data collected to guide the optimization.

-Mvect=sse: Enable vectorization of code using SSE

-O<level>: 4 is the highest level of optimization with aggressive techniques

-tp=<target type>: Optimize code for the target processor. Top choices: barcelona, barcelona-64, amd64, amd64e, core2, core2-64

-Munroll: Enable loop unrolling

-Mconcur: autoparallelize loops

-Minline: Inline functions automatically. One can also provide the name of the function to inline.

-mp: Enable recognizing OpenMP directives

-Mloop32: Align innermost loops on 32 byte boundary on Barcelona processors. Small loops run faster with this flag on Barcelona.