Archive for the ‘HPC Systems’ Category

Testing automated updates

Thursday, October 29th, 2009

This is a test.

Building the beast

Friday, September 11th, 2009

As detailed in yesterday’s post, here are the first pics of the system we are building … right now just bare metal … and the fun starts very soon.

Building the ultimate media workhorse

Thursday, September 10th, 2009


A prominent post production company approached us to build a system for post production (obviously!) workloads. The interesting part is that they like to use a high end card to capture up to four HD or SD video streams, a high end video card, an output card to composite real time video and graphics. If that’s not enough, we are adding a Fibre Channel card for storage access, another HD video capture card.

In a nutshell, we are building a system powerful enough to handle the demands of an HD digital video pipeline (DVP) for real time video processing, compositing and rich media production. Want more? They want it to run a linux OS and windows OS at once. So, lets try to wrap this up in a sentence: A dual socket system supporting two HD video capture cards, a high end graphics card, an HD output card, an FC card and run Windows and Linux virtualized with access to these cards. Can your vendor provide this system and support it?

The most popular vendor in digital media does provide digital media solutions (well, not like these systems anyway), but at a very high premium. Why? I don’t know. Maybe they spent all that time and energy in reaching out and establishing themselves as the leader in developing solutions for media space, which are not much different from other high end computing systems. And when you talk about high end computing, HPC Systems has delivered more complex machines than anyone can even imagine. We have delivered numerous multi-socket (4,8 socket) systems to some of the premier federal organizations for a variety of workloads, for forest fire simulations, for IC designs and simulation, for virtualization, for desk-side supercomputing and for financial modeling. We have even delivered a fully integrated single cluster that includes a Cell processor rack mount system, CUDA cards, accelerators, Opteron processors and Infiniband. There are not many who can deliver that complex designs.

The point of all this is not bragging but to say we don’t charge our customers a premium based on their requirements. We don’t charge you more because of the company size or because of anything else. However, we do charge for non-trivial software installation or configuration. Compared to the quality of the systems (with full software configuration and free phone, email support) we deliver the unbeatable value.

Well, I will keep you posted on the progress of this project. Keep coming back!

OFED 1.4 stack on RHEL 5.2

Friday, August 28th, 2009


I have been working with Infiniband since the first card came out from Topspin. My previous employer was a partner with Topspin for IB products. Having already worked with high speed interconnects like Myrinet, Scali (Dolphin Wulfkit) and of course, multiple versions of PARAMNet among countless others. Many have come and gone but Infiniband is here to stay.

Even with Cisco dropping out of Infiniband, strong support from QLogic, Voltair and Mellanox will keep it going for a while. Cisco has no advantage with Infiniband, their core business is Ethernet and they need to do what they need to do to keep Ethernet the core interconnect for everything. Even though it makes sense for Cisco, HPC is not everything. Its never been in the category of everything else. The requirements of HPC interconnects are very unique – low latency and high bandwidth are the heart and soul. Getting those two in a general purpose network would be nice but who would pay for some thing they don’t need.

Coming to the main topic of this post, configuring ConnectX Infiniband on RHEL 5.2 x86_64 with OFED 1.4.

OFED is very well packaged and most of the time does not need additional work for installation. Here is the simple method:

  1. Download OFED
  2. Extract the files (tar –zxvf OFED-x.y.tgz
  3. run the install script (
  4. For non-HPC installation, menu choices 2-1 will suffice, for HPC specific installation, choose 2-2 or 2-3. You are pretty safe choosing 2-3. If you choose 2-2, some infiniband diagnostic utilities wont be installed. However, you will end up with HPC specific packages like MPI.
  5. Make a note of required packages and you can find almost all of them on the RedHat disk. If you are registered to RHN, you can use yum to install the same.
  6. At this point, the needed kernel modules (drivers & upper level protocols) should be installed.
  7. The installer will ask if you like to configure IPoIB (IP tunneling over Infiniband). Say Y if you plan to use IPoIB and provide the IP addresses. If not, say N
  8. Issue a reboot command and after the system reboots, check lsmod for the list of modules currently loaded
  9. You should see a list of kernel modules with names starting with ib_ (ib_cm, ib_core, ib_umad, etc)
  10. At this point, we can safely assume the drivers are loaded and the adapter is working. You can check the status of the installation using the diagnostics included with OFED. More on that below.
  11. We have to have a working Subnet Manager for the Infiniband fabric to work. If you are using a managed switch like QLogic 9024, it generally includes an embedded Fabric Management component. If you are using an entry level switch without an embedded subnet manager or you like to run your own SM on a host system, you can use OpenSM (OpenSubNetManager) component bundled with OFED. Start the OpenSM using the command  /etc/init.d/opensmd start   NOTE: Till you have a working subnet manager, the adapters will not be able to do any useful work.


OFED comes with some basic diagnostic commands that can be used to test the status of the cards in your system. One of them is ibv_devinfo. This command prints the adapter status and attributes.

[root@localhost ~]# ibv_devinfo
hca_id: mlx4_0
        fw_ver:                         2.3.000
        node_guid:                      0030:48ff:ff95:d928
        sys_image_guid:                 0030:48ff:ff95:d92b
        vendor_id:                      0×02c9
        vendor_part_id:                 25418
        hw_ver:                         0xA0
        board_id:                       SM_1021000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0×00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0×00

In the above output, check the port “state”. When you have a working subnet manager, it will show up as PORT_ACTIVE or PORT_UP. Without a working subnet manager, it will show up as PORT_INIT or POLLING.

The state is shown as PORT_DOWN when there is no cable connected to the port.

To list adapters in the system:

[root@localhost ~]# ibv_devices
    device                 node GUID
    ——              —————-
    mlx4_0              003048ffff95d928

Once you have a working subnet manager and you have at least two ports showing up as “PORT_ACTIVE” on at least two machines, you can test the fabric using a simple pingpong or sendrecv test routines.

Start ibv_rc_pingpong on one machine

Start ibv_rc_pingpong <host name or ip> on another machines. hostname should be the name of the first machine on which the command was started.

If everything is working as it should, you should see the following output:

First host:

[root@localhost x86_64]# ibv_rc_pingpong
  local address:  LID 0×0002, QPN 0×00004a, PSN 0×43da29
  remote address: LID 0×0001, QPN 0×00004a, PSN 0×446364
8192000 bytes in 0.01 seconds = 6202.54 Mbit/sec
1000 iters in 0.01 seconds = 10.57 usec/iter


Second Host:

[root@localhost ~]# ibv_rc_pingpong
  local address:  LID 0×0001, QPN 0×00004a, PSN 0×446364
  remote address: LID 0×0002, QPN 0×00004a, PSN 0×43da29
8192000 bytes in 0.01 seconds = 6172.16 Mbit/sec
1000 iters in 0.01 seconds = 10.62 usec/iter

Depending on the type of card, cable, switch, OS, board chipset and PCI expansion slot you use, your bandwidth and latency will vary significantly. And this is only a functional test and is not a test for best bandwidth and latency.

Other diagnostic tools:

  1. ibstat – diaply IB device status like firmware version, ports state, GUIDs, etc (similar to ibv_devinfo)
  2. ibnetdiscover – discovers IB network topology
  3. ibhosts – shows IB nodes in topology
  4. ibchecknet – runs IB network validation
  5. ibping – ping IB address
  6. ibdatacounters – summary of ib ports

and more …

Performance Tests:

OFED bundles a few programs to test the bandwidth and latency of your Infiniband fabric.

Bandwidth test:

  1. start ib_read_bw on one machine
  2. start ib_read_bw <hostname or ip> on second machine

Latency Test:

  1. start ib_read_lat on one machine
  2. start ib_read_lat <hostname or ip> on second machine

make sure the power management is turned off before you run these test.

In case of any problems, the first thing to check is the subnet manager, then the ibstat and ibchecknet tools.

HPC Systems is now QLogic Infiniband SignatureHPC Partner

Thursday, August 6th, 2009

We are proud to announce that HPC Systems, Inc. is now QLogic SignatureHPC Partner for Infiniband products.

That means our employees, both sales and technical, are certified in Infiniband technologies. Our sales team will be able to recommend to you the best Infiniband solution for your needs and our technical team will deliver on those commitments.

We have always been ahead in delivering technically sound and well designed Infiniband Solutions to our customers in academic, research, federal govt and cfd space. Our partnership and certification with QLogic takes us one more step ahead in delivering the best in High Peformance Computing to our customers.


HPC Systems at SuperComputing 08 (SC08)

Thursday, December 4th, 2008

We were at booth number: 1726

On display was:

HiPerStation 8000 with 2X NVIDIA Tesla C1060

Here is a brief video of our exhibit at SC08. The demo shows couple of codes from the NVIDIA CUDA SDK and an instance of NAMD ported to CUDA .

Unable to update SUSE Linux 10

Wednesday, December 3rd, 2008

If you get this error message when trying to update a SUSE 10 based system using the Novell Customer Center Configuration menu

Execute curl command failed with '60':
curl: (60) SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify

the easiest & fastest way to fix it is to

Check the system time!

Yes, simple as that. Make sure your system time is correct and your update should proceed smoothly.

There goes almost a day wasted trying to figure this out.

As you can tell, I am not that good with SUSE, or am I? :)

Installing Fedora Core 9 and Cell SDK 3.1 on Cell Blade

Wednesday, December 3rd, 2008

We recently had a customer requesting a Cell Blade system to be integrated in to their Infiniband cluster. Since they were looking at having only one node, we suggested using the 1U dual cell based system. I am going to explain here the process of installing Fedora Core 9 on this system. This should also apply to other RedHat based distributions.

If you are considering purchasing Mercury Systems 1U Dual Cell Based System from Mercury Systems, please note that they have humongous lead times. For the system we purchased the lead time was about 16 weeks. Another important aspect is that this system comes with just two Cell processors and memory on board. Nothing else. No hard disk, no PCI slots. On board video is available but is not supported by the system. If you are going to use any add-on cards you will have to order the PCI expansion slots along with the system. To use disk storage, you will have to order a SAS disk with the system and the PCI riser card as well. This is something we overlooked, hopefully this will help someone else when purchasing from Mercury Systems.

Turning on your system: The system cannot be accessed via the regular KVM connections. The provided serial cable has to be connected to a stand alone PC and a utility like HyperTerminal or minicom has to be used to access the system console.

  • Start HyperTerminal or minicom and open the serial port connection.
  • Switch on the system.
  • You will see lot of text go by. Press “s” a number of times to enter the firmware prompt. The system boots from network by default
  • Once the firmware prompt appears, you can choose which device to boot from
  • ex: boot net to boot from network
  • Two hotkeys F1 and F2 are available for entering the management system (BIOS)

System Installation: Cell system (Mercury Systems 1U Dual Cell Blade Based System or IBM QS22) cannot boot from a disk. The system can boot only from network. This is actually a big inconvenience because neither FC9 nor RHEL 5.2 support NFS based (nfsroot) installs. This becomes sort of a chicken & egg problem. Cell system can boot only from network but the OS does not support NFS root install. YellowDog Linux 6.1 from Terrasoft (now Fixstars) advertises fast nfs root install support. There is a nice installation tutorial available for YDL here. The guide does not mention that the NFS root install is available only for commercial version. After a good amount of wasted hours trying to do an NFS root install with YDL, I gave up on it.

IBM Support has a nice page on how to install Fedora / RedHat based distributions on QS21 / QS22 using a USB disk.
Using the IBM Support page and a USB disk, I was able to finally get the system running. Here is the procedure for Fedora Core 9 PPC:

  • You will need a TFTP / DHCP server to install or a USB DVD ROM drive. Instructions on setting up TFTP / DHCP server can be found here.
  • Copy /images/netboot/ppc64.img to the TFTP root directory. This is the kernel the system will boot when using TFTP/DHCP setup. If you are using a DVD drive, just boot from the DVD. Make sure to check the boot order. By default network is the first boot device. You can force booting from the firmware prompt (pressing “s” while system is booting) using the command “boot
  • Get a nice USB hard disk. According to the IBM Support page, only IBM 80 GB USB & Lenovo 120 GB USB are supported. I am using Western Digital 320 GB USB disk (My Book). I did face some issues with this, not serious though. More information below on the work around.
  • At the firmware prompt, use “boot net vnc” to boot the system over the network.
  • Answer the installer prompts till the GUI starts
  • Now use a VNC client to connect to the installer using the IP provided by the installer
  • When using a large USB disk (80 GB+), the installer will exit abnormally immediately after clicking “next” in the GUI welcome screen. If you do want to use a large disk, the workaround is to disconnect the USB disk before clicking “next” on the GUI installer welcome screen. As soon as the next screen shows up, reconnect the USB drive.
  • Do the install as any other RedHat/CentOS/Fedora Core install. A nice guide is available here.
  • When the installer finishes, do not click the “Reboot”.
  • Now go back to the serial console and use the following commands:
    • umount /mnt/sysimage/sys
    • umount /mnt/sysimage/proc
    • chroot /mnt/sysimage
    • source /etc/profile
    • mount /sys
    • mount /proc
    • Disable SELinux: Open /etc/selinux/config and change “SELINUX=’enforcing’” to “SELINUX=’disabled’”
    • Make sure your network card is set to use DHCP before going forward. If you have setup static IP, temporariliy change the configuration to use DHCP. This can be done by moving the configuration file: mv /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth0.bak
    • Generate a new zImage to boot the kernel ramdisk from the network.
      • /sbin/mkinitrd –with=tg3 –with=nfs –net-dev=eth0 /boot/initrd-2.6.25-14.fc9-net.ppc64.img 2.6.25-14.fc9.ppc64
      • At this time, if you had static IP and moved the configuration file, move the file back: mv /etc/sysconfig/network-scripts/ifcfg-eth0.bak /etc/sysconfig/network-scripts/ifcfg-eth0
      • wrapper -i /boot/initrd-2.6.25-14.fc9-net.ppc64.img -o zImage.initrd-2.6.25-14.fc9-net.ppc64.img /boot/vmlinuz-2.6.25-14.fc9.ppc64.img
    • Now copy the generated zImage to the TFTP root directory using scp or by copying it to a USB disk.
    • Exit the choort environment
      • umount /sys
      • umount /proc
      • exit
  • Now go back to the installer GUI and click on “Reboot”

This concludes the installation. Make sure you copy the generated zImage to the TFTP root directory so this image is privoded to the  system when it boots after the installation.

Post Install Configuration:
Boot the system with the new zImage. The system will boot using the attached USB disk. You will be able to look at the boot process from the serial console. Now login as root.

  • The first  step is to install a Cell BE optimized kernel.
  • Download the kernel from BSC site: wget
  • Install the kernel: rpm -ivh –force kernel-
  • Add “–nodeps” to the command above if it does not successfully install the kernel.
  • Now generate a new zImage as per the above instructions using the newly installed initrd and vmlinuz (
  • Copy this zImage over to the TFTP root directory and over write the old zImage generated with FC 9 kernel (2.6.25-14.fc9)
  • Reboot to boot in to the new kernel.

SDK Installation & Executing Demo code:
SDK installation is pretty straight forward.

  • Download the SDK v3.1 from IBM.
  • Instructions on SDK installation are avbailable here from IBM. Only to lookout for is to install tcl before SDK installer can be installed: yum install tcl and then install SDK installer: rpm -ivh cell-install-3.1.0-0.0.noarch.rpm
  • Important Note: Follow the instructions on IBM site to add exclude directives to YUM to rpevent YUM from over writing packages optimized for Cell BE.
  • Compiling demo code also is simple. Use the provided make files.
  • Before executing any demo codes, it is advisable to configure and mount a hugeTLBFS file system.
  • To maximize the performance, large data sets should be allocated from the Huge-TLBfs. This filesystem provides a mechanism for allocating 16MB memory pages. To check the size and number of available pages, examine /proc/meminfo. If Huge-TLBFS is configured and available, /proc/meminfo will have entries as follows:
  • HugePages_Total:    24
    HugePages_Free:     24
    HugePages_Rsvd:      0
    HugePages_Surp:      0
    Hugepagesize:    16384 kB

  • If your system has not been configured with a hugetlbfs, perform the following:
    mkdir -p /huge
    mount -t hugetlbfs nodev /huge
    echo # > /proc/sys/vm/nr_hugepages
    where # is the number of huge pages you want allocated to the hugetlbfs.
  •  If you experience difficulty configuring adequate huge pages, memory may be fragmented and a reboot may be required.
  • This sequence can also be added to a startup initialize script, like /etc/rc.d/rc.sysinit, so the hugeTLB filesystem is configured during system boot.
  • A test run of Matrix Multiplication code at /opt/cell/sdk/src/demos/matrix_mul is as follows:
  • [root@cellbe matrix_mul]# ./matrix_mul -i 3 -m 192 -s 8 -v 64 -n -o 4 -p
    Initializing Arrays … done
    Running test … done
    Verifying 64 entries … PASSED
    Performance Statistics:
    number of SPEs     = 8
    execution time     = 0.00 seconds
    computation rate   = 91.66 GFlops/sec
    data transfer rate = 6.70 GBytes/sec