Kernel recompilation & HPC application performance

October 30th, 2009

Some questions never die, kernel recompilation for improving the performance of an application is one of them. I have heard this question from users from various domains (CFD, seismic, financial, oil & gas, academic, bio-molecular modeling and so on and so on). It always starts the same way.

“I think I should recompile the kernel of my cluster so I can have better performance. What do you think?”

And my answer always is “No”. It does sound logical … you compile your code with the best possible optimizations and you get better performance (in most cases, I should add). Why does it not apply to the kernel? After all, kernel is what is managing my processes, running my system. It’s easy to start the debate this way but miss out a key aspect.

Here are a few key questions to ask before you start on this (almost always) fruitless exercise:

  • How much time do I actually spend in the kernel when you are running your (scientific) code?
  • How much of that time is actually spent doing something useful than waiting on something else (good old friend, disk I/O, interrupt handling)?

With newer interconnects like Infiniband which use user level drivers and employ kernel bypass to drastically improve latencies (barring the initial setup time), how much performance improvement can you really expect from recompiling your kernel?

Kernel recompilation can also bring cluster management headaches:

  • Deploy the new kernel to every node in the cluster
  • Recompile your kernel every time a new security or performance related patch is released
  • Recompile your hardware drivers to match your new kernel
  • Stability and performance issues of drivers with your choice of compiler optimizations
  • Not knowing what areas of the kernel code are adversely affected by your choice of optimizations
  • And not to forget, some ISV’s support their code on only certain kernels only. Once you start using your ISV code on a different kernel, goodbye vendor support!

A more practical approach would be to look in to the application code and make optimizations in its code either through good old hand tuning or through performance libraries or straight forward compiler optimizations. Beware if you are dealing with floating point and double precision arithmetic, you should tread carefully when using more aggressive compiler optimizations. Several compilers do not guarantee precision at higher optimizations.

Using simple techniques like data decomposition, functional decomposition, overlapping computation & communication and pipelining to improve the efficiency of your available compute resources. This will yield a better return on investment especially when we are moving in to an increasingly many-core environment.

There is a paper on how profile-based optimization of the kernel yielded a significant performance improvement. More on that here.

And results from a recent article on Gentoo essentially show that for most applications and usage cases, it does not make much sense to compile and build your own kernel.

Pulling it all together, integrated updates

October 29th, 2009

I know I am behind a little bit on updating this blog.

Today I made significant progress in this regard. I finally integrated all my social networks in one place and linked them all to this blog.

So when I make publish a new post, it is automatically updated to my facebook, linkedin, twitter, googletalk, yahoo and live. I think that’s enough distribution, isn’t it? :)

Stay tuned for an update on the new storage we announced recently.

Testing automated updates

October 29th, 2009

This is a test.

Building the beast

September 11th, 2009

As detailed in yesterday’s post, here are the first pics of the system we are building … right now just bare metal … and the fun starts very soon.

Building the ultimate media workhorse

September 10th, 2009


A prominent post production company approached us to build a system for post production (obviously!) workloads. The interesting part is that they like to use a high end card to capture up to four HD or SD video streams, a high end video card, an output card to composite real time video and graphics. If that’s not enough, we are adding a Fibre Channel card for storage access, another HD video capture card.

In a nutshell, we are building a system powerful enough to handle the demands of an HD digital video pipeline (DVP) for real time video processing, compositing and rich media production. Want more? They want it to run a linux OS and windows OS at once. So, lets try to wrap this up in a sentence: A dual socket system supporting two HD video capture cards, a high end graphics card, an HD output card, an FC card and run Windows and Linux virtualized with access to these cards. Can your vendor provide this system and support it?

The most popular vendor in digital media does provide digital media solutions (well, not like these systems anyway), but at a very high premium. Why? I don’t know. Maybe they spent all that time and energy in reaching out and establishing themselves as the leader in developing solutions for media space, which are not much different from other high end computing systems. And when you talk about high end computing, HPC Systems has delivered more complex machines than anyone can even imagine. We have delivered numerous multi-socket (4,8 socket) systems to some of the premier federal organizations for a variety of workloads, for forest fire simulations, for IC designs and simulation, for virtualization, for desk-side supercomputing and for financial modeling. We have even delivered a fully integrated single cluster that includes a Cell processor rack mount system, CUDA cards, accelerators, Opteron processors and Infiniband. There are not many who can deliver that complex designs.

The point of all this is not bragging but to say we don’t charge our customers a premium based on their requirements. We don’t charge you more because of the company size or because of anything else. However, we do charge for non-trivial software installation or configuration. Compared to the quality of the systems (with full software configuration and free phone, email support) we deliver the unbeatable value.

Well, I will keep you posted on the progress of this project. Keep coming back!

OFED 1.4 stack on RHEL 5.2

August 28th, 2009


I have been working with Infiniband since the first card came out from Topspin. My previous employer was a partner with Topspin for IB products. Having already worked with high speed interconnects like Myrinet, Scali (Dolphin Wulfkit) and of course, multiple versions of PARAMNet among countless others. Many have come and gone but Infiniband is here to stay.

Even with Cisco dropping out of Infiniband, strong support from QLogic, Voltair and Mellanox will keep it going for a while. Cisco has no advantage with Infiniband, their core business is Ethernet and they need to do what they need to do to keep Ethernet the core interconnect for everything. Even though it makes sense for Cisco, HPC is not everything. Its never been in the category of everything else. The requirements of HPC interconnects are very unique – low latency and high bandwidth are the heart and soul. Getting those two in a general purpose network would be nice but who would pay for some thing they don’t need.

Coming to the main topic of this post, configuring ConnectX Infiniband on RHEL 5.2 x86_64 with OFED 1.4.

OFED is very well packaged and most of the time does not need additional work for installation. Here is the simple method:

  1. Download OFED
  2. Extract the files (tar –zxvf OFED-x.y.tgz
  3. run the install script (
  4. For non-HPC installation, menu choices 2-1 will suffice, for HPC specific installation, choose 2-2 or 2-3. You are pretty safe choosing 2-3. If you choose 2-2, some infiniband diagnostic utilities wont be installed. However, you will end up with HPC specific packages like MPI.
  5. Make a note of required packages and you can find almost all of them on the RedHat disk. If you are registered to RHN, you can use yum to install the same.
  6. At this point, the needed kernel modules (drivers & upper level protocols) should be installed.
  7. The installer will ask if you like to configure IPoIB (IP tunneling over Infiniband). Say Y if you plan to use IPoIB and provide the IP addresses. If not, say N
  8. Issue a reboot command and after the system reboots, check lsmod for the list of modules currently loaded
  9. You should see a list of kernel modules with names starting with ib_ (ib_cm, ib_core, ib_umad, etc)
  10. At this point, we can safely assume the drivers are loaded and the adapter is working. You can check the status of the installation using the diagnostics included with OFED. More on that below.
  11. We have to have a working Subnet Manager for the Infiniband fabric to work. If you are using a managed switch like QLogic 9024, it generally includes an embedded Fabric Management component. If you are using an entry level switch without an embedded subnet manager or you like to run your own SM on a host system, you can use OpenSM (OpenSubNetManager) component bundled with OFED. Start the OpenSM using the command  /etc/init.d/opensmd start   NOTE: Till you have a working subnet manager, the adapters will not be able to do any useful work.


OFED comes with some basic diagnostic commands that can be used to test the status of the cards in your system. One of them is ibv_devinfo. This command prints the adapter status and attributes.

[root@localhost ~]# ibv_devinfo
hca_id: mlx4_0
        fw_ver:                         2.3.000
        node_guid:                      0030:48ff:ff95:d928
        sys_image_guid:                 0030:48ff:ff95:d92b
        vendor_id:                      0×02c9
        vendor_part_id:                 25418
        hw_ver:                         0xA0
        board_id:                       SM_1021000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0×00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0×00

In the above output, check the port “state”. When you have a working subnet manager, it will show up as PORT_ACTIVE or PORT_UP. Without a working subnet manager, it will show up as PORT_INIT or POLLING.

The state is shown as PORT_DOWN when there is no cable connected to the port.

To list adapters in the system:

[root@localhost ~]# ibv_devices
    device                 node GUID
    ——              —————-
    mlx4_0              003048ffff95d928

Once you have a working subnet manager and you have at least two ports showing up as “PORT_ACTIVE” on at least two machines, you can test the fabric using a simple pingpong or sendrecv test routines.

Start ibv_rc_pingpong on one machine

Start ibv_rc_pingpong <host name or ip> on another machines. hostname should be the name of the first machine on which the command was started.

If everything is working as it should, you should see the following output:

First host:

[root@localhost x86_64]# ibv_rc_pingpong
  local address:  LID 0×0002, QPN 0×00004a, PSN 0×43da29
  remote address: LID 0×0001, QPN 0×00004a, PSN 0×446364
8192000 bytes in 0.01 seconds = 6202.54 Mbit/sec
1000 iters in 0.01 seconds = 10.57 usec/iter


Second Host:

[root@localhost ~]# ibv_rc_pingpong
  local address:  LID 0×0001, QPN 0×00004a, PSN 0×446364
  remote address: LID 0×0002, QPN 0×00004a, PSN 0×43da29
8192000 bytes in 0.01 seconds = 6172.16 Mbit/sec
1000 iters in 0.01 seconds = 10.62 usec/iter

Depending on the type of card, cable, switch, OS, board chipset and PCI expansion slot you use, your bandwidth and latency will vary significantly. And this is only a functional test and is not a test for best bandwidth and latency.

Other diagnostic tools:

  1. ibstat – diaply IB device status like firmware version, ports state, GUIDs, etc (similar to ibv_devinfo)
  2. ibnetdiscover – discovers IB network topology
  3. ibhosts – shows IB nodes in topology
  4. ibchecknet – runs IB network validation
  5. ibping – ping IB address
  6. ibdatacounters – summary of ib ports

and more …

Performance Tests:

OFED bundles a few programs to test the bandwidth and latency of your Infiniband fabric.

Bandwidth test:

  1. start ib_read_bw on one machine
  2. start ib_read_bw <hostname or ip> on second machine

Latency Test:

  1. start ib_read_lat on one machine
  2. start ib_read_lat <hostname or ip> on second machine

make sure the power management is turned off before you run these test.

In case of any problems, the first thing to check is the subnet manager, then the ibstat and ibchecknet tools.

AMD delivers OpenCL SDK beta for x86

August 6th, 2009

AMD announced the availability of OpenCL SDK for x86 processor cores.

The first publicly available beta of OpenCL SDK will allow developers to write portable code supporting both x86 processors and compatible GPUs. At release, OpenCL SDK will be delivered as a part of the ATI Stream Software Development Kit.

For the uninitiated, OpenCL is an open programming standard, supported by a number of industry vendors, for writing source code to target multi-core CPUs and GPU execution units. OpenCL is designed ground up to support parallel computing paradigms using task-based and data-based parallelism.

NVidia also has an SDK in the works and we can expect to see NVidia’s version very soon, esp. after their continued demos at SIGGRPAH. NVIDIA’s SDK will, obviously, support NVIDIA GPU’s. If they will support x86 cores, is yet to be seen. AMD, on the other hand, has an incentive to support x86 cores and GPU in its release. It will accelerate adoption of ATI Stream GPU’s and Opteron processors. AMD is in a unique position because of its product line – GPU & x86 CPU, whereas, NVIDIA has only GPU and Intel has only CPU. Let’s hope AMD manages to take advantage this opportunity.

AMD’s OpenCL demo with AMD Opteron Istanbul is below. This demo is on a 4 socket AMD Opteron system with six core Istanbul processor. I cant wait to try it on our 48 core AMD Opteron system (8 socket) with Istanbul processors.

HPC Systems is now QLogic Infiniband SignatureHPC Partner

August 6th, 2009

We are proud to announce that HPC Systems, Inc. is now QLogic SignatureHPC Partner for Infiniband products.

That means our employees, both sales and technical, are certified in Infiniband technologies. Our sales team will be able to recommend to you the best Infiniband solution for your needs and our technical team will deliver on those commitments.

We have always been ahead in delivering technically sound and well designed Infiniband Solutions to our customers in academic, research, federal govt and cfd space. Our partnership and certification with QLogic takes us one more step ahead in delivering the best in High Peformance Computing to our customers.


Light at the end of the tunnel for AMD with Istanbul

February 26th, 2009

Intel’s Nehalem is looming on the horizon and promises to take the crown in every aspect of enterprise & high performance computing. Intel has been talking about Nehalem for so long that it almost seems like Nehalem is a couple of years old even though the launch has not yet happened. Talk about marketing!

AMD has taken quite a beating with Intel’s new Core micro-architecture featured in its latest server processors (starting from Intel Xeon Woodcrest to Intel Harpertown). AMD Opteron Shanghai managed to hold its ground in the 4P space and lost a lot of market share on the DP space. Or in Intel’s latest terminology, EP space (Efficient Platform). The MP is now called the EX or Expandable Platform.

Istanbul, the next Opteron series from AMD features six cores and a faster HyperTransport interconnect. There have been a lot of official & unofficial news on Istanbul online in the recent past. AMD is maintaining its platform compatibility with Istanbul as well. If you already own a Shanghai or Barcelona based Opteron server, you should be able to upgrade to Istanbul processors with as little effort as a BIOS upgrade. The only requirement on the board being that it support Split Plane Power, also known as Dual Dynamic Power Management feature. DDPM allows the integrated memory controller and processor cores to run at varying levels of performance thus leading to enhanced power management.

Initial evaluations of Istanbul stand to prove that Opteron is still relevant in the High Performance (HPC) & Enterprise markets and is a strong competitor to Intel’s Nehalem processor. An Opteron 4P platform (four socket, quad core, 16way) is demonstrating up to 40 GB/s of memory bandwidth compared to the ~25 GB/s of AMD Shanghai Opteron system. Read the full report from TechReport here. Istanbul Opteron CPUs feature a number of new technologies that make it a strong contender for the Nehalem CPUs. HyperTransport 3.0 being the most notable enhancement, along with AMD’s implementation of a snoop filter, HTAssist.

Following is a brief video about the platform compatibility between Shanghai & Istanbul processors. AMD has been delivering platform compatibility so much better than Intel for a long time and the savings to the end user are tremendous. Talk about upgrading to a newer & better processor in under 10 mins!

Compare that with the costs involved in replacing an existing server with another brand new server. And talk about the price of DDR3 memory that Nehalem needs. DDR2-800 that Istanbul will use has pretty much become main stream and the pricing should be very competitive compared with the DDR3 modules. AMD has let Intel take the responsibility of stabilizing DDR3 costs and will probably position a DDR3 enabled CPU at the right time.

So what does AMD have going for it right now with Istanbul:

Bad economy – Businesses & customers would be more sensitive to pricing. Istanbul provides the best value for customers who already own a Shanghai or Barcelona based server. In-socket replacement, very low downtime for upgrades and better performance with just a change of CPU.

DDR2 memory – DDR2 memory is now priced very competitively against DDR3. This brings down the overall cost of the system. Istanbul will use DDR2 instead of the more costly DDR3 memory. This probably will be a repetition of the same situation as FullyBuffered DIMMs. FBDIMMs were costly & hot degrading the overall value of the Core micro-architecture. DDR3 may do the same for Intel’s Nehalem given the current circumstances.

Tried & Tested – AMD Opteron has had the integrated memory controller for a long time. Intel’s Nehalem platforms are brand new with a totally new architecture & components. New Micro-architecture, new memory architecture, new memory technology & other minor new components like power distribution & supporting chipsets – bring so many new components together & its hard to get everything right the first time. Customers will probably take a wait & see attitude towards the Nehalem platform than go gaga over the latest & greatest

Here is a brief video talking about DDR2 & DDR3


Here is another video demonstrating a 4 socket AMD Opteron system much like our own A1403 Quad Opteron 16way server and HiperStation 4000.


And stay tuned to see a 48-core server soon!


Integrating Cell based system in to ROCKS

December 11th, 2008

After successfully installing Fedora 9 on Cell based system (Mercury 1U dual cell blade based system), now we had to integrate it in to a ROCKS cluster.

ROCKS sends the appropriate kernel image by looking at the vendor-class-identifier information. Current DHCP configuration file supports only IA64 (EFI), x86_64, x86 and of course, network switches. Although, ROCKS no longer supports IA64 (Itanium), the code is still there.

The first task is to add the Cell system in to the ROCKS database. We decided to add the node as a “Remote Management” appliance than as a compute node. Adding as compute node would modify the configuration files for SGE or PBS and will always show up as “down” status. To do this, execute the following command:

insert-ethers --mac <give your mac id here>

When the insert-ethers UI shows up, select “Remote Management” and hit ok. You may also choose to provide your own hostname using the option “–hostname”

The next task to identify the vendor class identification for the Cell system. After a quick test, it was determined that the system had no vendor class identifier. Since we were dealing with only one system, the best option was to match the MAC ID of the sytem with the following elsif block:

        } elsif ((binary-to-ascii(16,8,":",substring(hardware,1,6))="0:1a:64:e:2a:94")) {
                # Cell blade System
                filename "cellbe.img";

“cellbe.img” is the kernel image for the Cell system. This has to be copied to “/tftpboot/pxelinux/”.

These changes will be lost if dhcpd.conf is overwritten, which happens every time you execute insert-ethers or use

dbreport dhcpd

to overwrite the file.

You could generate a patch file and patch the dhcpd.conf every as needed or you could edit


to include the new elsif block everytime the file is generated.

If you see your cell system is trying to load


means your dhcpd.conf file is overwritten.


Code block to identify the vendor class identifier and other useful information:

               log(info, concat("Debug Information:\t",
               binary-to-ascii(16,8, ":", substring(hardware,1,6)),
               binary-to-ascii(10,8, "-", option dhcp-parameter-request-list),
               pick-first-value(option vendor-class-identifier,"no-identifier")