Archive for the ‘Product in depth’ Category

Integrating Cell based system in to ROCKS

Thursday, December 11th, 2008

After successfully installing Fedora 9 on Cell based system (Mercury 1U dual cell blade based system), now we had to integrate it in to a ROCKS cluster.

ROCKS sends the appropriate kernel image by looking at the vendor-class-identifier information. Current DHCP configuration file supports only IA64 (EFI), x86_64, x86 and of course, network switches. Although, ROCKS no longer supports IA64 (Itanium), the code is still there.

The first task is to add the Cell system in to the ROCKS database. We decided to add the node as a “Remote Management” appliance than as a compute node. Adding as compute node would modify the configuration files for SGE or PBS and will always show up as “down” status. To do this, execute the following command:

insert-ethers --mac <give your mac id here>

When the insert-ethers UI shows up, select “Remote Management” and hit ok. You may also choose to provide your own hostname using the option “–hostname”

The next task to identify the vendor class identification for the Cell system. After a quick test, it was determined that the system had no vendor class identifier. Since we were dealing with only one system, the best option was to match the MAC ID of the sytem with the following elsif block:

        } elsif ((binary-to-ascii(16,8,":",substring(hardware,1,6))="0:1a:64:e:2a:94")) {
                # Cell blade System
                filename "cellbe.img";

“cellbe.img” is the kernel image for the Cell system. This has to be copied to “/tftpboot/pxelinux/”.

These changes will be lost if dhcpd.conf is overwritten, which happens every time you execute insert-ethers or use

dbreport dhcpd

to overwrite the file.

You could generate a patch file and patch the dhcpd.conf every as needed or you could edit


to include the new elsif block everytime the file is generated.

If you see your cell system is trying to load


means your dhcpd.conf file is overwritten.


Code block to identify the vendor class identifier and other useful information:

               log(info, concat("Debug Information:\t",
               binary-to-ascii(16,8, ":", substring(hardware,1,6)),
               binary-to-ascii(10,8, "-", option dhcp-parameter-request-list),
               pick-first-value(option vendor-class-identifier,"no-identifier")

Installing Fedora Core 9 and Cell SDK 3.1 on Cell Blade

Wednesday, December 3rd, 2008

We recently had a customer requesting a Cell Blade system to be integrated in to their Infiniband cluster. Since they were looking at having only one node, we suggested using the 1U dual cell based system. I am going to explain here the process of installing Fedora Core 9 on this system. This should also apply to other RedHat based distributions.

If you are considering purchasing Mercury Systems 1U Dual Cell Based System from Mercury Systems, please note that they have humongous lead times. For the system we purchased the lead time was about 16 weeks. Another important aspect is that this system comes with just two Cell processors and memory on board. Nothing else. No hard disk, no PCI slots. On board video is available but is not supported by the system. If you are going to use any add-on cards you will have to order the PCI expansion slots along with the system. To use disk storage, you will have to order a SAS disk with the system and the PCI riser card as well. This is something we overlooked, hopefully this will help someone else when purchasing from Mercury Systems.

Turning on your system: The system cannot be accessed via the regular KVM connections. The provided serial cable has to be connected to a stand alone PC and a utility like HyperTerminal or minicom has to be used to access the system console.

  • Start HyperTerminal or minicom and open the serial port connection.
  • Switch on the system.
  • You will see lot of text go by. Press “s” a number of times to enter the firmware prompt. The system boots from network by default
  • Once the firmware prompt appears, you can choose which device to boot from
  • ex: boot net to boot from network
  • Two hotkeys F1 and F2 are available for entering the management system (BIOS)

System Installation: Cell system (Mercury Systems 1U Dual Cell Blade Based System or IBM QS22) cannot boot from a disk. The system can boot only from network. This is actually a big inconvenience because neither FC9 nor RHEL 5.2 support NFS based (nfsroot) installs. This becomes sort of a chicken & egg problem. Cell system can boot only from network but the OS does not support NFS root install. YellowDog Linux 6.1 from Terrasoft (now Fixstars) advertises fast nfs root install support. There is a nice installation tutorial available for YDL here. The guide does not mention that the NFS root install is available only for commercial version. After a good amount of wasted hours trying to do an NFS root install with YDL, I gave up on it.

IBM Support has a nice page on how to install Fedora / RedHat based distributions on QS21 / QS22 using a USB disk.
Using the IBM Support page and a USB disk, I was able to finally get the system running. Here is the procedure for Fedora Core 9 PPC:

  • You will need a TFTP / DHCP server to install or a USB DVD ROM drive. Instructions on setting up TFTP / DHCP server can be found here.
  • Copy /images/netboot/ppc64.img to the TFTP root directory. This is the kernel the system will boot when using TFTP/DHCP setup. If you are using a DVD drive, just boot from the DVD. Make sure to check the boot order. By default network is the first boot device. You can force booting from the firmware prompt (pressing “s” while system is booting) using the command “boot
  • Get a nice USB hard disk. According to the IBM Support page, only IBM 80 GB USB & Lenovo 120 GB USB are supported. I am using Western Digital 320 GB USB disk (My Book). I did face some issues with this, not serious though. More information below on the work around.
  • At the firmware prompt, use “boot net vnc” to boot the system over the network.
  • Answer the installer prompts till the GUI starts
  • Now use a VNC client to connect to the installer using the IP provided by the installer
  • When using a large USB disk (80 GB+), the installer will exit abnormally immediately after clicking “next” in the GUI welcome screen. If you do want to use a large disk, the workaround is to disconnect the USB disk before clicking “next” on the GUI installer welcome screen. As soon as the next screen shows up, reconnect the USB drive.
  • Do the install as any other RedHat/CentOS/Fedora Core install. A nice guide is available here.
  • When the installer finishes, do not click the “Reboot”.
  • Now go back to the serial console and use the following commands:
    • umount /mnt/sysimage/sys
    • umount /mnt/sysimage/proc
    • chroot /mnt/sysimage
    • source /etc/profile
    • mount /sys
    • mount /proc
    • Disable SELinux: Open /etc/selinux/config and change “SELINUX=’enforcing’” to “SELINUX=’disabled’”
    • Make sure your network card is set to use DHCP before going forward. If you have setup static IP, temporariliy change the configuration to use DHCP. This can be done by moving the configuration file: mv /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth0.bak
    • Generate a new zImage to boot the kernel ramdisk from the network.
      • /sbin/mkinitrd –with=tg3 –with=nfs –net-dev=eth0 /boot/initrd-2.6.25-14.fc9-net.ppc64.img 2.6.25-14.fc9.ppc64
      • At this time, if you had static IP and moved the configuration file, move the file back: mv /etc/sysconfig/network-scripts/ifcfg-eth0.bak /etc/sysconfig/network-scripts/ifcfg-eth0
      • wrapper -i /boot/initrd-2.6.25-14.fc9-net.ppc64.img -o zImage.initrd-2.6.25-14.fc9-net.ppc64.img /boot/vmlinuz-2.6.25-14.fc9.ppc64.img
    • Now copy the generated zImage to the TFTP root directory using scp or by copying it to a USB disk.
    • Exit the choort environment
      • umount /sys
      • umount /proc
      • exit
  • Now go back to the installer GUI and click on “Reboot”

This concludes the installation. Make sure you copy the generated zImage to the TFTP root directory so this image is privoded to the  system when it boots after the installation.

Post Install Configuration:
Boot the system with the new zImage. The system will boot using the attached USB disk. You will be able to look at the boot process from the serial console. Now login as root.

  • The first  step is to install a Cell BE optimized kernel.
  • Download the kernel from BSC site: wget
  • Install the kernel: rpm -ivh –force kernel-
  • Add “–nodeps” to the command above if it does not successfully install the kernel.
  • Now generate a new zImage as per the above instructions using the newly installed initrd and vmlinuz (
  • Copy this zImage over to the TFTP root directory and over write the old zImage generated with FC 9 kernel (2.6.25-14.fc9)
  • Reboot to boot in to the new kernel.

SDK Installation & Executing Demo code:
SDK installation is pretty straight forward.

  • Download the SDK v3.1 from IBM.
  • Instructions on SDK installation are avbailable here from IBM. Only to lookout for is to install tcl before SDK installer can be installed: yum install tcl and then install SDK installer: rpm -ivh cell-install-3.1.0-0.0.noarch.rpm
  • Important Note: Follow the instructions on IBM site to add exclude directives to YUM to rpevent YUM from over writing packages optimized for Cell BE.
  • Compiling demo code also is simple. Use the provided make files.
  • Before executing any demo codes, it is advisable to configure and mount a hugeTLBFS file system.
  • To maximize the performance, large data sets should be allocated from the Huge-TLBfs. This filesystem provides a mechanism for allocating 16MB memory pages. To check the size and number of available pages, examine /proc/meminfo. If Huge-TLBFS is configured and available, /proc/meminfo will have entries as follows:
  • HugePages_Total:    24
    HugePages_Free:     24
    HugePages_Rsvd:      0
    HugePages_Surp:      0
    Hugepagesize:    16384 kB

  • If your system has not been configured with a hugetlbfs, perform the following:
    mkdir -p /huge
    mount -t hugetlbfs nodev /huge
    echo # > /proc/sys/vm/nr_hugepages
    where # is the number of huge pages you want allocated to the hugetlbfs.
  •  If you experience difficulty configuring adequate huge pages, memory may be fragmented and a reboot may be required.
  • This sequence can also be added to a startup initialize script, like /etc/rc.d/rc.sysinit, so the hugeTLB filesystem is configured during system boot.
  • A test run of Matrix Multiplication code at /opt/cell/sdk/src/demos/matrix_mul is as follows:
  • [root@cellbe matrix_mul]# ./matrix_mul -i 3 -m 192 -s 8 -v 64 -n -o 4 -p
    Initializing Arrays … done
    Running test … done
    Verifying 64 entries … PASSED
    Performance Statistics:
    number of SPEs     = 8
    execution time     = 0.00 seconds
    computation rate   = 91.66 GFlops/sec
    data transfer rate = 6.70 GBytes/sec

Memory Latency on AMD Opteron 2354

Tuesday, November 6th, 2007

In the continuing posts regarding our benchmarking exercise, we now share the memory latencies on AMD Opteron 2354.

 The setup is essentially same as described in the previous posts. I will refrain from detailing the same for brevity.

LMBench 3.0 Alpha 8 was used to measure the memory latencies.

Here are the numbers:

L1 cache: 1.366 ns

L2 cache: 5.99 ns

Main Memory: 89.1

Random Memory: 184.0

The latencies look good so far.  The main memory latency is little bit higher than the latency from AMD Opteron 22xx series. However, Opteron 23xx series has an additional shared L3 cache of 2 MB. From other reviews on the web, it look slike this additional L3 cache is adding the latency.

Its the first cut … More numbers too come soon.

memory bandwidth on AMD Opteron 2354

Monday, November 5th, 2007

We got our hands on a new mainboard supporting the split plane (Dual Dynamic Power Management) feature of AMD Opteron quad core (Barcelona) processors. The earlier mainboards do support Barcelona fully but not the split plane feature. Due to this, the memory controller on the Barcelona and the L2 cache run at a slower clock than on a split plane board. Slower clock rate implies lower memory bandwidth and incerased latency compared to the same processor on a split place board.

 Well, this could a great opportunity to test what improvements does the split plane offers in terms of memory performance.

 The test system is setup as follows:

HPC Systems, Inc. A1204

Dual AMD Opteron 2354

8 X 1 GB DDR2 667 MHz


Western Digital 250 GB SATA hard drive

SUN Studio 12

STREAM benchmark

Problem size: N = 20000000

Compiler command used

suncc -fast -xO4 -xprefetch -xprefetch_level=3 -xvector=simd -xarch=sse3 -xdepend  -m64 -xopenmp -o stream.big ../stream.c  -xlic_lib=sunperf -I../

Performance for 1 thread (compiled without -xopenmp flag):

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        5724.3136       0.0559       0.0559       0.0559
Scale:       6077.3024       0.0527       0.0527       0.0527
Add:         5692.4606       0.0843       0.0843       0.0844
Triad:       5696.1831       0.0843       0.0843       0.0843
Solution Validates

We did see a higher bandwidth number with PGI compilers … close to 6.5 GB/s but we are unable to post the result becasue the license has expired for the binaries compiled with PGI compilers.

Performance for 4 threads:

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       12230.5392       0.0262       0.0262       0.0262
Scale:      12099.2614       0.0265       0.0264       0.0265
Add:        11536.8169       0.0417       0.0416       0.0417
Triad:      11543.9895       0.0417       0.0416       0.0418
Solution Validates

Performance for 8 threads:

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       17516.0718       0.0183       0.0183       0.0183
Scale:      17382.8602       0.0184       0.0184       0.0185
Add:        16455.8826       0.0292       0.0292       0.0293
Triad:      16519.7865       0.0291       0.0291       0.0291
Solution Validates

 From the numbers, we seem to have hit the same performance as advertised on AMD web site.

The peak bandwidth of a 2P AMD Opteron system is 21.2 GB/s. We achieved a sustained of 17.5 GB/s i.e a sustained value of 82%

Here are the results with only one socket populated. This exercise is important to eliminate the issues of how the memory is allocated across sockets and also the issue of threads scheduled on different sockets.

Performance for 1 threads (compiled without -xopenmp flag) :

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        6256.7322       0.0528       0.0527       0.0528
Scale:       6417.2126       0.0499       0.0499       0.0499
Add:         6306.9054       0.0761       0.0761       0.0762
Triad:       6333.5465       0.0758       0.0758       0.0758
Solution Validates

Performance for 4 threads :

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        9148.0695       0.0350       0.0350       0.0351
Scale:       9080.6064       0.0353       0.0352       0.0353
Add:         8510.1783       0.0565       0.0564       0.0565
Triad:       8511.8559       0.0564       0.0564       0.0565
Solution Validates

That is about 9.1 GB/s sustained from a peak of 10.1 Gb/s, i.e 90% efficiency

PGI Compiler 7.1 (7.1-1) and bundled ACML for Barcelona

Thursday, November 1st, 2007

I am using PGI 7.1 compilers for my benchmakring exercise. The compiler includes an ACML version bundled with it and the compiler supports AMD Opteron Quadcore Barcelona. Naturally, I did not think twice and started linking with ACML provided with the compiler. 

The best DGEMM number I got was about 53% of the peak. That does not seem right. However, the same ACML version did provide a DGEMM value as high as 87% on AMD Opteron dual core.  

Ater wasting a some time and efforts, I downloaded the ACML from AMD Developer Central. Linking BLASBench with this new ACML, I was able to get a DGEMM value that was about 87% of the peak.

Maybe this post will save you some time if you are using ACML with PGI compilers.

Please note: You need to provide the following libraries to the linker if you are linking with C compiler pgcc: -lrt -lpgftnrtl when linking with ACML from AMD developer site.

-lpgftnrtl links FORTRAN runtime with the code.

If you are using FORTRAN to link the code, pgf77, it is not needed to provide -lpgftnrtl

If you are linking with FORTRAN compiler but the main() is in a C file, provide -Mnomain to the linker.

Missing -Mnomain will throw up the following error:

bb.o: In function `main’:
bb.c:(.text+0xde0): multiple definition of `main’
/opt/pgi/linux86-64/7.1-1/lib/pgfmain.o:pgfmain.c:(.text+0×0): first defined here
/usr/bin/ld: Warning: size of symbol `main’ changed from 79 in /opt/pgi/linux86-64/7.1-1/lib/pgfmain.o to 13982 in bb.o
/opt/pgi/linux86-64/7.1-1/lib/pgfmain.o: In function `main’:
pgfmain.c:(.text+0×34): undefined reference to `MAIN_’

using C compiler, pgcc, to link the code and failing to provide -lpgftnrtl will result in the following error:

/opt/acml4.0.0/pgi64/lib/libacml.a(dgemv.o): In function `dgemv.pgi.uni.1_’:
dgemv.F:(.text+0×508): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgemv.o): In function `dgemv.pgi.uni.2_’:
dgemv.F:(.text+0×1518): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgemv.o): In function `sgemv.pgi.uni.1_’:
sgemv.F:(.text+0×4eb): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgemv.o): In function `sgemv.pgi.uni.2_’:
sgemv.F:(.text+0xfd0): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(xerbla.o): In function `xerbla.pgi.uni.1_’:
xerbla.f:(.text+0×5f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×74): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×90): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0xa3): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0xa8): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0xb1): undefined reference to `ftn_stop’
xerbla.f:(.text+0xe2): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×11b): undefined reference to `fio_src_info’
xerbla.f:(.text+0×139): undefined reference to `fio_fmtr_intern_init’
xerbla.f:(.text+0×152): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×16b): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×184): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×19d): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×1a2): undefined reference to `fio_fmtr_end’
xerbla.f:(.text+0×1fe): undefined reference to `fio_src_info’
xerbla.f:(.text+0×215): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×228): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×240): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×245): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×25d): undefined reference to `fio_src_info’
xerbla.f:(.text+0×274): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×287): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×28c): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×2a7): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×2c1): undefined reference to `fio_src_info’
xerbla.f:(.text+0×2d8): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×2f4): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×310): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×315): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×34f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×36d): undefined reference to `fio_fmtw_intern_init’
xerbla.f:(.text+0×385): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×39d): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3b5): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3cd): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×3d2): undefined reference to `fio_fmtw_end’
/opt/acml4.0.0/pgi64/lib/libacml.a(xerbla.o): In function `xerbla.pgi.uni.2_’:
xerbla.f:(.text+0×46f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×484): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×4a0): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×4b3): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×4b8): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×4c1): undefined reference to `ftn_stop’
xerbla.f:(.text+0×4f2): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×52b): undefined reference to `fio_src_info’
xerbla.f:(.text+0×549): undefined reference to `fio_fmtr_intern_init’
xerbla.f:(.text+0×562): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×57b): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×594): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×5ad): undefined reference to `fio_fmt_read’
xerbla.f:(.text+0×5b2): undefined reference to `fio_fmtr_end’
xerbla.f:(.text+0×60e): undefined reference to `fio_src_info’
xerbla.f:(.text+0×625): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×638): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×650): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×655): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×66d): undefined reference to `fio_src_info’
xerbla.f:(.text+0×684): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×697): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×69c): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×6b7): undefined reference to `ftn_strcmp’
xerbla.f:(.text+0×6d1): undefined reference to `fio_src_info’
xerbla.f:(.text+0×6e8): undefined reference to `fio_fmtw_init’
xerbla.f:(.text+0×704): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×720): undefined reference to `fio_fmt_write’
xerbla.f:(.text+0×725): undefined reference to `fio_fmtw_end’
xerbla.f:(.text+0×75f): undefined reference to `fio_src_info’
xerbla.f:(.text+0×77d): undefined reference to `fio_fmtw_intern_init’
xerbla.f:(.text+0×795): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7ad): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7c5): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7dd): undefined reference to `fio_sc_fmt_write’
xerbla.f:(.text+0×7e2): undefined reference to `fio_fmtw_end’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgeblkmatS.o): In function `dgeblkmats.pgi.uni.1_’:
dgeblkmatS.f:(.text+0×80): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(dgeblkmatS.o): In function `dgeblkmats.pgi.uni.2_’:
dgeblkmatS.f:(.text+0×480): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgeblk2matS.o): In function `sgeblk2mats.pgi.uni.1_’:
sgeblk2matS.f:(.text+0×7b): undefined reference to `ftn_str_index’
/opt/acml4.0.0/pgi64/lib/libacml.a(sgeblk2matS.o): In function `sgeblk2mats.pgi.uni.2_’:
sgeblk2matS.f:(.text+0×50b): undefined reference to `ftn_str_index’
child process exit status 1: /usr/bin/ld