Archive for the ‘Performance and Benchmarks’ Category

Kernel recompilation & HPC application performance

Friday, October 30th, 2009

Some questions never die, kernel recompilation for improving the performance of an application is one of them. I have heard this question from users from various domains (CFD, seismic, financial, oil & gas, academic, bio-molecular modeling and so on and so on). It always starts the same way.

“I think I should recompile the kernel of my cluster so I can have better performance. What do you think?”

And my answer always is “No”. It does sound logical … you compile your code with the best possible optimizations and you get better performance (in most cases, I should add). Why does it not apply to the kernel? After all, kernel is what is managing my processes, running my system. It’s easy to start the debate this way but miss out a key aspect.

Here are a few key questions to ask before you start on this (almost always) fruitless exercise:

  • How much time do I actually spend in the kernel when you are running your (scientific) code?
  • How much of that time is actually spent doing something useful than waiting on something else (good old friend, disk I/O, interrupt handling)?

With newer interconnects like Infiniband which use user level drivers and employ kernel bypass to drastically improve latencies (barring the initial setup time), how much performance improvement can you really expect from recompiling your kernel?

Kernel recompilation can also bring cluster management headaches:

  • Deploy the new kernel to every node in the cluster
  • Recompile your kernel every time a new security or performance related patch is released
  • Recompile your hardware drivers to match your new kernel
  • Stability and performance issues of drivers with your choice of compiler optimizations
  • Not knowing what areas of the kernel code are adversely affected by your choice of optimizations
  • And not to forget, some ISV’s support their code on only certain kernels only. Once you start using your ISV code on a different kernel, goodbye vendor support!

A more practical approach would be to look in to the application code and make optimizations in its code either through good old hand tuning or through performance libraries or straight forward compiler optimizations. Beware if you are dealing with floating point and double precision arithmetic, you should tread carefully when using more aggressive compiler optimizations. Several compilers do not guarantee precision at higher optimizations.

Using simple techniques like data decomposition, functional decomposition, overlapping computation & communication and pipelining to improve the efficiency of your available compute resources. This will yield a better return on investment especially when we are moving in to an increasingly many-core environment.

There is a paper on how profile-based optimization of the kernel yielded a significant performance improvement. More on that here.

And results from a recent article on Gentoo essentially show that for most applications and usage cases, it does not make much sense to compile and build your own kernel.

OFED 1.4 stack on RHEL 5.2

Friday, August 28th, 2009


I have been working with Infiniband since the first card came out from Topspin. My previous employer was a partner with Topspin for IB products. Having already worked with high speed interconnects like Myrinet, Scali (Dolphin Wulfkit) and of course, multiple versions of PARAMNet among countless others. Many have come and gone but Infiniband is here to stay.

Even with Cisco dropping out of Infiniband, strong support from QLogic, Voltair and Mellanox will keep it going for a while. Cisco has no advantage with Infiniband, their core business is Ethernet and they need to do what they need to do to keep Ethernet the core interconnect for everything. Even though it makes sense for Cisco, HPC is not everything. Its never been in the category of everything else. The requirements of HPC interconnects are very unique – low latency and high bandwidth are the heart and soul. Getting those two in a general purpose network would be nice but who would pay for some thing they don’t need.

Coming to the main topic of this post, configuring ConnectX Infiniband on RHEL 5.2 x86_64 with OFED 1.4.

OFED is very well packaged and most of the time does not need additional work for installation. Here is the simple method:

  1. Download OFED
  2. Extract the files (tar –zxvf OFED-x.y.tgz
  3. run the install script (
  4. For non-HPC installation, menu choices 2-1 will suffice, for HPC specific installation, choose 2-2 or 2-3. You are pretty safe choosing 2-3. If you choose 2-2, some infiniband diagnostic utilities wont be installed. However, you will end up with HPC specific packages like MPI.
  5. Make a note of required packages and you can find almost all of them on the RedHat disk. If you are registered to RHN, you can use yum to install the same.
  6. At this point, the needed kernel modules (drivers & upper level protocols) should be installed.
  7. The installer will ask if you like to configure IPoIB (IP tunneling over Infiniband). Say Y if you plan to use IPoIB and provide the IP addresses. If not, say N
  8. Issue a reboot command and after the system reboots, check lsmod for the list of modules currently loaded
  9. You should see a list of kernel modules with names starting with ib_ (ib_cm, ib_core, ib_umad, etc)
  10. At this point, we can safely assume the drivers are loaded and the adapter is working. You can check the status of the installation using the diagnostics included with OFED. More on that below.
  11. We have to have a working Subnet Manager for the Infiniband fabric to work. If you are using a managed switch like QLogic 9024, it generally includes an embedded Fabric Management component. If you are using an entry level switch without an embedded subnet manager or you like to run your own SM on a host system, you can use OpenSM (OpenSubNetManager) component bundled with OFED. Start the OpenSM using the command  /etc/init.d/opensmd start   NOTE: Till you have a working subnet manager, the adapters will not be able to do any useful work.


OFED comes with some basic diagnostic commands that can be used to test the status of the cards in your system. One of them is ibv_devinfo. This command prints the adapter status and attributes.

[root@localhost ~]# ibv_devinfo
hca_id: mlx4_0
        fw_ver:                         2.3.000
        node_guid:                      0030:48ff:ff95:d928
        sys_image_guid:                 0030:48ff:ff95:d92b
        vendor_id:                      0×02c9
        vendor_part_id:                 25418
        hw_ver:                         0xA0
        board_id:                       SM_1021000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0×00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0×00

In the above output, check the port “state”. When you have a working subnet manager, it will show up as PORT_ACTIVE or PORT_UP. Without a working subnet manager, it will show up as PORT_INIT or POLLING.

The state is shown as PORT_DOWN when there is no cable connected to the port.

To list adapters in the system:

[root@localhost ~]# ibv_devices
    device                 node GUID
    ——              —————-
    mlx4_0              003048ffff95d928

Once you have a working subnet manager and you have at least two ports showing up as “PORT_ACTIVE” on at least two machines, you can test the fabric using a simple pingpong or sendrecv test routines.

Start ibv_rc_pingpong on one machine

Start ibv_rc_pingpong <host name or ip> on another machines. hostname should be the name of the first machine on which the command was started.

If everything is working as it should, you should see the following output:

First host:

[root@localhost x86_64]# ibv_rc_pingpong
  local address:  LID 0×0002, QPN 0×00004a, PSN 0×43da29
  remote address: LID 0×0001, QPN 0×00004a, PSN 0×446364
8192000 bytes in 0.01 seconds = 6202.54 Mbit/sec
1000 iters in 0.01 seconds = 10.57 usec/iter


Second Host:

[root@localhost ~]# ibv_rc_pingpong
  local address:  LID 0×0001, QPN 0×00004a, PSN 0×446364
  remote address: LID 0×0002, QPN 0×00004a, PSN 0×43da29
8192000 bytes in 0.01 seconds = 6172.16 Mbit/sec
1000 iters in 0.01 seconds = 10.62 usec/iter

Depending on the type of card, cable, switch, OS, board chipset and PCI expansion slot you use, your bandwidth and latency will vary significantly. And this is only a functional test and is not a test for best bandwidth and latency.

Other diagnostic tools:

  1. ibstat – diaply IB device status like firmware version, ports state, GUIDs, etc (similar to ibv_devinfo)
  2. ibnetdiscover – discovers IB network topology
  3. ibhosts – shows IB nodes in topology
  4. ibchecknet – runs IB network validation
  5. ibping – ping IB address
  6. ibdatacounters – summary of ib ports

and more …

Performance Tests:

OFED bundles a few programs to test the bandwidth and latency of your Infiniband fabric.

Bandwidth test:

  1. start ib_read_bw on one machine
  2. start ib_read_bw <hostname or ip> on second machine

Latency Test:

  1. start ib_read_lat on one machine
  2. start ib_read_lat <hostname or ip> on second machine

make sure the power management is turned off before you run these test.

In case of any problems, the first thing to check is the subnet manager, then the ibstat and ibchecknet tools.

Compiling BLACS with OpenMPI and GCC on RHEL 5 / CentOS 5

Wednesday, March 12th, 2008

I had some problems compiling BLACS with OpenMPI and GCC on RHEL 5 / CentOS 5. So, here is how I got it to compile and pass the tests successfully:

OpenMPI: 1.2.5

BLACS: 1.1 with MPIBLACS Patch 03 (Feb 24, 2000)

GCC: 4.1.2

F77 = gfortran

F90 = gfortran

CC = gcc

CXX = g++

Bmake file used: BMAKES/Bmake.MPI-LINUX

Changes made to Bmake:


#  ————————————-
#  Name and location of the MPI library.
#  ————————————-
MPIdir = /home/test/openmpi-install/
MPIINCdir = $(MPIdir)/include






#=========================== SECTION 3: COMPILERS ============================
#  The following macros specify compilers, linker/loaders, the archiver,
#  and their options.  Some of the fortran files need to be compiled with no
#  optimization.  This is the F77NO_OPTFLAG.  The usage of the remaining
#  macros should be obvious from the names.
F77            = $(MPIdir)/bin/mpif77
F77FLAGS       = $(F77NO_OPTFLAGS) -O3 -mtune=amdfam10 -march=amdfam10
F77LOADER      = $(F77)
CC             = $(MPIdir)/bin/mpicc
CCFLAGS        = -O3 -mtune=amdfam10 -march=amdfam10
CCLOADER       = $(CC)
Of special importance are the flags:



If INTFACE is not set correctly, make tester will fail with following messages:

blacstest.o(.text+0x4c): In function `MAIN__':

: undefined reference to `blacs_pinfo_'

blacstest.o(.text+0x6e): In function `MAIN__':

: undefined reference to `blacs_get_'

blacstest.o(.text+0x8b): In function `MAIN__':

: undefined reference to `blacs_gridinit_'

blacstest.o(.text+0x94): In function `MAIN__':

More such errors follow.

If TRANSCOMM is not set correctly, make tester will complete sucecssfully and you will be able to successfully execute C interface tests also. When executing FORTRAN interface tests, the following messages are seen:

BLACS WARNING 'No need to set message ID range due to MPI communicator.
'from {-1,-1}, pnum=1, Contxt=-1, on line 18 of file 'blacs_set_.c'.
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=3, Contxt=-1, on line 18 of file 'blacs_set_.c'.
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=0, Contxt=-1, on line 18 of file 'blacs_set_.c'.
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=2, Contxt=-1, on line 18 of file 'blacs_set_.c'.
[comp-pvfs-0-7.local:30119] *** An error occurred in MPI_Comm_group
[comp-pvfs-0-7.local:30118] *** An error occurred in MPI_Comm_group
[comp-pvfs-0-7.local:30118] *** on communicator MPI_COMM_WORLD
[comp-pvfs-0-7.local:30118] *** MPI_ERR_COMM: invalid communicator
[comp-pvfs-0-7.local:30119] *** on communicator MPI_COMM_WORLD
[comp-pvfs-0-7.local:30119] *** MPI_ERR_COMM: invalid communicator
[comp-pvfs-0-7.local:30119] *** MPI_ERRORS_ARE_FATAL (goodbye) 

VMWare ESX server on 32 cores

Tuesday, January 8th, 2008

Well, what a pleasant way to start the day!

After many successful tests yesterday (instal & test, iSCSI & NFS), I populated the system with 8 quad core AMD Opteron processors and left for the day hoping to come back and fire up ESX on 32 cores.

And here it is …. ESX working like a charm on 32 cores.

VMWare ESX on 32 cores

I hope to get those VMMark numbers out soon!



Memory Latency on AMD Opteron 2354

Tuesday, November 6th, 2007

In the continuing posts regarding our benchmarking exercise, we now share the memory latencies on AMD Opteron 2354.

 The setup is essentially same as described in the previous posts. I will refrain from detailing the same for brevity.

LMBench 3.0 Alpha 8 was used to measure the memory latencies.

Here are the numbers:

L1 cache: 1.366 ns

L2 cache: 5.99 ns

Main Memory: 89.1

Random Memory: 184.0

The latencies look good so far.  The main memory latency is little bit higher than the latency from AMD Opteron 22xx series. However, Opteron 23xx series has an additional shared L3 cache of 2 MB. From other reviews on the web, it look slike this additional L3 cache is adding the latency.

Its the first cut … More numbers too come soon.

memory bandwidth on AMD Opteron 2354

Monday, November 5th, 2007

We got our hands on a new mainboard supporting the split plane (Dual Dynamic Power Management) feature of AMD Opteron quad core (Barcelona) processors. The earlier mainboards do support Barcelona fully but not the split plane feature. Due to this, the memory controller on the Barcelona and the L2 cache run at a slower clock than on a split plane board. Slower clock rate implies lower memory bandwidth and incerased latency compared to the same processor on a split place board.

 Well, this could a great opportunity to test what improvements does the split plane offers in terms of memory performance.

 The test system is setup as follows:

HPC Systems, Inc. A1204

Dual AMD Opteron 2354

8 X 1 GB DDR2 667 MHz


Western Digital 250 GB SATA hard drive

SUN Studio 12

STREAM benchmark

Problem size: N = 20000000

Compiler command used

suncc -fast -xO4 -xprefetch -xprefetch_level=3 -xvector=simd -xarch=sse3 -xdepend  -m64 -xopenmp -o stream.big ../stream.c  -xlic_lib=sunperf -I../

Performance for 1 thread (compiled without -xopenmp flag):

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        5724.3136       0.0559       0.0559       0.0559
Scale:       6077.3024       0.0527       0.0527       0.0527
Add:         5692.4606       0.0843       0.0843       0.0844
Triad:       5696.1831       0.0843       0.0843       0.0843
Solution Validates

We did see a higher bandwidth number with PGI compilers … close to 6.5 GB/s but we are unable to post the result becasue the license has expired for the binaries compiled with PGI compilers.

Performance for 4 threads:

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       12230.5392       0.0262       0.0262       0.0262
Scale:      12099.2614       0.0265       0.0264       0.0265
Add:        11536.8169       0.0417       0.0416       0.0417
Triad:      11543.9895       0.0417       0.0416       0.0418
Solution Validates

Performance for 8 threads:

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       17516.0718       0.0183       0.0183       0.0183
Scale:      17382.8602       0.0184       0.0184       0.0185
Add:        16455.8826       0.0292       0.0292       0.0293
Triad:      16519.7865       0.0291       0.0291       0.0291
Solution Validates

 From the numbers, we seem to have hit the same performance as advertised on AMD web site.

The peak bandwidth of a 2P AMD Opteron system is 21.2 GB/s. We achieved a sustained of 17.5 GB/s i.e a sustained value of 82%

Here are the results with only one socket populated. This exercise is important to eliminate the issues of how the memory is allocated across sockets and also the issue of threads scheduled on different sockets.

Performance for 1 threads (compiled without -xopenmp flag) :

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        6256.7322       0.0528       0.0527       0.0528
Scale:       6417.2126       0.0499       0.0499       0.0499
Add:         6306.9054       0.0761       0.0761       0.0762
Triad:       6333.5465       0.0758       0.0758       0.0758
Solution Validates

Performance for 4 threads :

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        9148.0695       0.0350       0.0350       0.0351
Scale:       9080.6064       0.0353       0.0352       0.0353
Add:         8510.1783       0.0565       0.0564       0.0565
Triad:       8511.8559       0.0564       0.0564       0.0565
Solution Validates

That is about 9.1 GB/s sustained from a peak of 10.1 Gb/s, i.e 90% efficiency

PGI Compiler 7.1 (7.1-1) and bundled ACML for Barcelona

Thursday, November 1st, 2007

I am using PGI 7.1 compilers for my benchmakring exercise. The compiler includes an ACML version bundled with it and the compiler supports AMD Opteron Quadcore Barcelona. Naturally, I did not think twice and started linking with ACML provided with the compiler. 

The best DGEMM number I got was about 53% of the peak. That does not seem right. However, the same ACML version did provide a DGEMM value as high as 87% on AMD Opteron dual core.  

Ater wasting a some time and efforts, I downloaded the ACML from AMD Developer Central. Linking BLASBench with this new ACML, I was able to get a DGEMM value that was about 87% of the peak.

Maybe this post will save you some time if you are using ACML with PGI compilers.

Please note: You need to provide the following libraries to the linker if you are linking with C compiler pgcc: -lrt -lpgftnrtl when linking with ACML from AMD developer site.

-lpgftnrtl links FORTRAN runtime with the code.

If you are using FORTRAN to link the code, pgf77, it is not needed to provide -lpgftnrtl

If you are linking with FORTRAN compiler but the main() is in a C file, provide -Mnomain to the linker.

Missing -Mnomain will throw up the following error:

bb.o: In function `main’:
bb.c:(.text+0xde0): multiple definition of `main’
/opt/pgi/linux86-64/7.1-1/lib/pgfmain.o:pgfmain.c:(.text+0×0): first defined here
/usr/bin/ld: Warning: size of symbol `main’ changed from 79 in /opt/pgi/linux86-64/7.1-1/lib/pgfmain.o to 13982 in bb.o
/opt/pgi/linux86-64/7.1-1/lib/pgfmain.o: In function `main’:
pgfmain.c:(.text+0×34): undefined reference to `MAIN_’

using C compiler, pgcc, to link the code and failing to provide -lpgftnrtl will result in the following error:

/opt/acml4.0.0/pgi64/lib/libacml.a(dgemv.o): In function `dgemv.pgi.uni.1_’:
dgemv.F:(.text+0×508): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgemv.o): In function `dgemv.pgi.uni.2_’:
dgemv.F:(.text+0×1518): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgemv.o): In function `sgemv.pgi.uni.1_’:
sgemv.F:(.text+0×4eb): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgemv.o): In function `sgemv.pgi.uni.2_’:
sgemv.F:(.text+0xfd0): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(xerbla.o): In function `xerbla.pgi.uni.1_’:
xerbla.f:(.text+0×5f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×74): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×90): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0xa3): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0xa8): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0xb1): undefined reference to `ftn_stop’
xerbla.f:(.text+0xe2): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×11b): undefined reference to `fio_src_info’
xerbla.f:(.text+0×139): undefined reference to `fio_fmtr_intern_init’
xerbla.f:(.text+0×152): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×16b): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×184): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×19d): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×1a2): undefined reference to `fio_fmtr_end’
xerbla.f:(.text+0×1fe): undefined reference to `fio_src_info’
xerbla.f:(.text+0×215): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×228): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×240): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×245): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×25d): undefined reference to `fio_src_info’
xerbla.f:(.text+0×274): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×287): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×28c): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×2a7): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×2c1): undefined reference to `fio_src_info’
xerbla.f:(.text+0×2d8): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×2f4): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×310): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×315): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×34f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×36d): undefined reference to `fio_fmtw_intern_init’
xerbla.f:(.text+0×385): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×39d): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3b5): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3cd): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3d2): undefined reference to `fio_fmtw_end’
/opt/acml4.0.0/pgi64/lib/libacml.a(xerbla.o): In function `xerbla.pgi.uni.2_’:
xerbla.f:(.text+0×46f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×484): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×4a0): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×4b3): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×4b8): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×4c1): undefined reference to `ftn_stop’
xerbla.f:(.text+0×4f2): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×52b): undefined reference to `fio_src_info’
xerbla.f:(.text+0×549): undefined reference to `fio_fmtr_intern_init’
xerbla.f:(.text+0×562): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×57b): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×594): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×5ad): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×5b2): undefined reference to `fio_fmtr_end’
xerbla.f:(.text+0×60e): undefined reference to `fio_src_info’
xerbla.f:(.text+0×625): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×638): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×650): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×655): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×66d): undefined reference to `fio_src_info’
xerbla.f:(.text+0×684): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×697): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×69c): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×6b7): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×6d1): undefined reference to `fio_src_info’
xerbla.f:(.text+0×6e8): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×704): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×720): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×725): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×75f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×77d): undefined reference to `fio_fmtw_intern_init’
xerbla.f:(.text+0×795): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7ad): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7c5): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7dd): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7e2): undefined reference to `fio_fmtw_end’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgeblkmatS.o): In function `dgeblkmats.pgi.uni.1_’:
dgeblkmatS.f:(.text+0×80): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgeblkmatS.o): In function `dgeblkmats.pgi.uni.2_’:
dgeblkmatS.f:(.text+0×480): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgeblk2matS.o): In function `sgeblk2mats.pgi.uni.1_’:
sgeblk2matS.f:(.text+0×7b): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgeblk2matS.o): In function `sgeblk2mats.pgi.uni.2_’:
sgeblk2matS.f:(.text+0×50b): undefined reference to `ftn_str_index’
child process exit status 1: /usr/bin/ld

Compiler Optimizations for AMD Opteron Quadcore Barcelona with PGI compilers

Thursday, November 1st, 2007

As a part of our benchmarking exercise of A5808-32, we started using PGI compilers. After a number of experiments, the following compiler switches seem to give the best performance. Unless specified, the flags must be provided during compilation and linking phase.

-fast: The usual macro for starters. -fast implies -fastsse on 64-bit platforms

-fastsse: Enable SIMD operations.

-Mipa: Enable Interprocedural optimizations. Use as: -Mipa=fast,inline – IPA and automaitc procedure inlining. This enables a two pass compilation and linking.

-Mpfi & -Mpfo: Enable profile guided optimization. -Mpfi enables instrumentation. -Mpfo uses the data collected to guide the optimization.

-Mvect=sse: Enable vectorization of code using SSE

-O<level>: 4 is the highest level of optimization with aggressive techniques

-tp=<target type>: Optimize code for the target processor. Top choices: barcelona, barcelona-64, amd64, amd64e, core2, core2-64

-Munroll: Enable loop unrolling

-Mconcur: autoparallelize loops

-Minline: Inline functions automatically. One can also provide the name of the function to inline.

-mp: Enable recognizing OpenMP directives

-Mloop32: Align innermost loops on 32 byte boundary on Barcelona processors. Small loops run faster with this flag on Barcelona.

Benchmarking the A5808-32 Quad Core Opteron 8 socket 32 way

Thursday, October 25th, 2007

Our recent announcement of updated configuration for A5808-32 received a good deal of interest. The new configuration supports up to 8 Quad core AMD Opteron (Barcelona) processors and 256 GB of DDR II 667 MHz memory modules. Effectively the 8 socket system is transformed in to a 32-way Opteron system. It opens up a wide range of applications for a lot of our customers. For starters, it can become a desk side HPC system. We will shortly be announcing this system with a very unique form factor that will make it even more attractive for desk side HPC applications.

We have been seeing some really interested customers who cant wait to get one on their hands. It is a good encouragement for us to continue working on high end high performance systems.

In  order to demonstrate tangible benefits of the system configuration, we have decided to publish a number of industry standard benchmarks. As a part of this exercise, we have finalized on the following benchmarks to port, optimize and execute on A5808-32.

  • STREAM – measures memory bandwidth
  • LMBench -Measures bandwidth and latency for a number of common subsystems like memory, IO, network etc
  • SPEC CPU2006 – Measures CPU performance and system scalability for a range of real world applications
  • LLCBench – Measures sustained CPU performance, cache latencies and MPI benchmarks

We will be executing these benchmarks with various tools like GCC, SUN Studio 12, PGI and PathScale.

Keep watching for results.