Archive for the ‘Operating Systems’ Category

Kernel recompilation & HPC application performance

Friday, October 30th, 2009

Some questions never die, kernel recompilation for improving the performance of an application is one of them. I have heard this question from users from various domains (CFD, seismic, financial, oil & gas, academic, bio-molecular modeling and so on and so on). It always starts the same way.

“I think I should recompile the kernel of my cluster so I can have better performance. What do you think?”

And my answer always is “No”. It does sound logical … you compile your code with the best possible optimizations and you get better performance (in most cases, I should add). Why does it not apply to the kernel? After all, kernel is what is managing my processes, running my system. It’s easy to start the debate this way but miss out a key aspect.

Here are a few key questions to ask before you start on this (almost always) fruitless exercise:

  • How much time do I actually spend in the kernel when you are running your (scientific) code?
  • How much of that time is actually spent doing something useful than waiting on something else (good old friend, disk I/O, interrupt handling)?

With newer interconnects like Infiniband which use user level drivers and employ kernel bypass to drastically improve latencies (barring the initial setup time), how much performance improvement can you really expect from recompiling your kernel?

Kernel recompilation can also bring cluster management headaches:

  • Deploy the new kernel to every node in the cluster
  • Recompile your kernel every time a new security or performance related patch is released
  • Recompile your hardware drivers to match your new kernel
  • Stability and performance issues of drivers with your choice of compiler optimizations
  • Not knowing what areas of the kernel code are adversely affected by your choice of optimizations
  • And not to forget, some ISV’s support their code on only certain kernels only. Once you start using your ISV code on a different kernel, goodbye vendor support!

A more practical approach would be to look in to the application code and make optimizations in its code either through good old hand tuning or through performance libraries or straight forward compiler optimizations. Beware if you are dealing with floating point and double precision arithmetic, you should tread carefully when using more aggressive compiler optimizations. Several compilers do not guarantee precision at higher optimizations.

Using simple techniques like data decomposition, functional decomposition, overlapping computation & communication and pipelining to improve the efficiency of your available compute resources. This will yield a better return on investment especially when we are moving in to an increasingly many-core environment.

There is a paper on how profile-based optimization of the kernel yielded a significant performance improvement. More on that here.

And results from a recent article on Gentoo essentially show that for most applications and usage cases, it does not make much sense to compile and build your own kernel.

Building the ultimate media workhorse

Thursday, September 10th, 2009


A prominent post production company approached us to build a system for post production (obviously!) workloads. The interesting part is that they like to use a high end card to capture up to four HD or SD video streams, a high end video card, an output card to composite real time video and graphics. If that’s not enough, we are adding a Fibre Channel card for storage access, another HD video capture card.

In a nutshell, we are building a system powerful enough to handle the demands of an HD digital video pipeline (DVP) for real time video processing, compositing and rich media production. Want more? They want it to run a linux OS and windows OS at once. So, lets try to wrap this up in a sentence: A dual socket system supporting two HD video capture cards, a high end graphics card, an HD output card, an FC card and run Windows and Linux virtualized with access to these cards. Can your vendor provide this system and support it?

The most popular vendor in digital media does provide digital media solutions (well, not like these systems anyway), but at a very high premium. Why? I don’t know. Maybe they spent all that time and energy in reaching out and establishing themselves as the leader in developing solutions for media space, which are not much different from other high end computing systems. And when you talk about high end computing, HPC Systems has delivered more complex machines than anyone can even imagine. We have delivered numerous multi-socket (4,8 socket) systems to some of the premier federal organizations for a variety of workloads, for forest fire simulations, for IC designs and simulation, for virtualization, for desk-side supercomputing and for financial modeling. We have even delivered a fully integrated single cluster that includes a Cell processor rack mount system, CUDA cards, accelerators, Opteron processors and Infiniband. There are not many who can deliver that complex designs.

The point of all this is not bragging but to say we don’t charge our customers a premium based on their requirements. We don’t charge you more because of the company size or because of anything else. However, we do charge for non-trivial software installation or configuration. Compared to the quality of the systems (with full software configuration and free phone, email support) we deliver the unbeatable value.

Well, I will keep you posted on the progress of this project. Keep coming back!

Installing Fedora Core 9 and Cell SDK 3.1 on Cell Blade

Wednesday, December 3rd, 2008

We recently had a customer requesting a Cell Blade system to be integrated in to their Infiniband cluster. Since they were looking at having only one node, we suggested using the 1U dual cell based system. I am going to explain here the process of installing Fedora Core 9 on this system. This should also apply to other RedHat based distributions.

If you are considering purchasing Mercury Systems 1U Dual Cell Based System from Mercury Systems, please note that they have humongous lead times. For the system we purchased the lead time was about 16 weeks. Another important aspect is that this system comes with just two Cell processors and memory on board. Nothing else. No hard disk, no PCI slots. On board video is available but is not supported by the system. If you are going to use any add-on cards you will have to order the PCI expansion slots along with the system. To use disk storage, you will have to order a SAS disk with the system and the PCI riser card as well. This is something we overlooked, hopefully this will help someone else when purchasing from Mercury Systems.

Turning on your system: The system cannot be accessed via the regular KVM connections. The provided serial cable has to be connected to a stand alone PC and a utility like HyperTerminal or minicom has to be used to access the system console.

  • Start HyperTerminal or minicom and open the serial port connection.
  • Switch on the system.
  • You will see lot of text go by. Press “s” a number of times to enter the firmware prompt. The system boots from network by default
  • Once the firmware prompt appears, you can choose which device to boot from
  • ex: boot net to boot from network
  • Two hotkeys F1 and F2 are available for entering the management system (BIOS)

System Installation: Cell system (Mercury Systems 1U Dual Cell Blade Based System or IBM QS22) cannot boot from a disk. The system can boot only from network. This is actually a big inconvenience because neither FC9 nor RHEL 5.2 support NFS based (nfsroot) installs. This becomes sort of a chicken & egg problem. Cell system can boot only from network but the OS does not support NFS root install. YellowDog Linux 6.1 from Terrasoft (now Fixstars) advertises fast nfs root install support. There is a nice installation tutorial available for YDL here. The guide does not mention that the NFS root install is available only for commercial version. After a good amount of wasted hours trying to do an NFS root install with YDL, I gave up on it.

IBM Support has a nice page on how to install Fedora / RedHat based distributions on QS21 / QS22 using a USB disk.
Using the IBM Support page and a USB disk, I was able to finally get the system running. Here is the procedure for Fedora Core 9 PPC:

  • You will need a TFTP / DHCP server to install or a USB DVD ROM drive. Instructions on setting up TFTP / DHCP server can be found here.
  • Copy /images/netboot/ppc64.img to the TFTP root directory. This is the kernel the system will boot when using TFTP/DHCP setup. If you are using a DVD drive, just boot from the DVD. Make sure to check the boot order. By default network is the first boot device. You can force booting from the firmware prompt (pressing “s” while system is booting) using the command “boot
  • Get a nice USB hard disk. According to the IBM Support page, only IBM 80 GB USB & Lenovo 120 GB USB are supported. I am using Western Digital 320 GB USB disk (My Book). I did face some issues with this, not serious though. More information below on the work around.
  • At the firmware prompt, use “boot net vnc” to boot the system over the network.
  • Answer the installer prompts till the GUI starts
  • Now use a VNC client to connect to the installer using the IP provided by the installer
  • When using a large USB disk (80 GB+), the installer will exit abnormally immediately after clicking “next” in the GUI welcome screen. If you do want to use a large disk, the workaround is to disconnect the USB disk before clicking “next” on the GUI installer welcome screen. As soon as the next screen shows up, reconnect the USB drive.
  • Do the install as any other RedHat/CentOS/Fedora Core install. A nice guide is available here.
  • When the installer finishes, do not click the “Reboot”.
  • Now go back to the serial console and use the following commands:
    • umount /mnt/sysimage/sys
    • umount /mnt/sysimage/proc
    • chroot /mnt/sysimage
    • source /etc/profile
    • mount /sys
    • mount /proc
    • Disable SELinux: Open /etc/selinux/config and change “SELINUX=’enforcing’” to “SELINUX=’disabled’”
    • Make sure your network card is set to use DHCP before going forward. If you have setup static IP, temporariliy change the configuration to use DHCP. This can be done by moving the configuration file: mv /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth0.bak
    • Generate a new zImage to boot the kernel ramdisk from the network.
      • /sbin/mkinitrd –with=tg3 –with=nfs –net-dev=eth0 /boot/initrd-2.6.25-14.fc9-net.ppc64.img 2.6.25-14.fc9.ppc64
      • At this time, if you had static IP and moved the configuration file, move the file back: mv /etc/sysconfig/network-scripts/ifcfg-eth0.bak /etc/sysconfig/network-scripts/ifcfg-eth0
      • wrapper -i /boot/initrd-2.6.25-14.fc9-net.ppc64.img -o zImage.initrd-2.6.25-14.fc9-net.ppc64.img /boot/vmlinuz-2.6.25-14.fc9.ppc64.img
    • Now copy the generated zImage to the TFTP root directory using scp or by copying it to a USB disk.
    • Exit the choort environment
      • umount /sys
      • umount /proc
      • exit
  • Now go back to the installer GUI and click on “Reboot”

This concludes the installation. Make sure you copy the generated zImage to the TFTP root directory so this image is privoded to the  system when it boots after the installation.

Post Install Configuration:
Boot the system with the new zImage. The system will boot using the attached USB disk. You will be able to look at the boot process from the serial console. Now login as root.

  • The first  step is to install a Cell BE optimized kernel.
  • Download the kernel from BSC site: wget
  • Install the kernel: rpm -ivh –force kernel-
  • Add “–nodeps” to the command above if it does not successfully install the kernel.
  • Now generate a new zImage as per the above instructions using the newly installed initrd and vmlinuz (
  • Copy this zImage over to the TFTP root directory and over write the old zImage generated with FC 9 kernel (2.6.25-14.fc9)
  • Reboot to boot in to the new kernel.

SDK Installation & Executing Demo code:
SDK installation is pretty straight forward.

  • Download the SDK v3.1 from IBM.
  • Instructions on SDK installation are avbailable here from IBM. Only to lookout for is to install tcl before SDK installer can be installed: yum install tcl and then install SDK installer: rpm -ivh cell-install-3.1.0-0.0.noarch.rpm
  • Important Note: Follow the instructions on IBM site to add exclude directives to YUM to rpevent YUM from over writing packages optimized for Cell BE.
  • Compiling demo code also is simple. Use the provided make files.
  • Before executing any demo codes, it is advisable to configure and mount a hugeTLBFS file system.
  • To maximize the performance, large data sets should be allocated from the Huge-TLBfs. This filesystem provides a mechanism for allocating 16MB memory pages. To check the size and number of available pages, examine /proc/meminfo. If Huge-TLBFS is configured and available, /proc/meminfo will have entries as follows:
  • HugePages_Total:    24
    HugePages_Free:     24
    HugePages_Rsvd:      0
    HugePages_Surp:      0
    Hugepagesize:    16384 kB

  • If your system has not been configured with a hugetlbfs, perform the following:
    mkdir -p /huge
    mount -t hugetlbfs nodev /huge
    echo # > /proc/sys/vm/nr_hugepages
    where # is the number of huge pages you want allocated to the hugetlbfs.
  •  If you experience difficulty configuring adequate huge pages, memory may be fragmented and a reboot may be required.
  • This sequence can also be added to a startup initialize script, like /etc/rc.d/rc.sysinit, so the hugeTLB filesystem is configured during system boot.
  • A test run of Matrix Multiplication code at /opt/cell/sdk/src/demos/matrix_mul is as follows:
  • [root@cellbe matrix_mul]# ./matrix_mul -i 3 -m 192 -s 8 -v 64 -n -o 4 -p
    Initializing Arrays … done
    Running test … done
    Verifying 64 entries … PASSED
    Performance Statistics:
    number of SPEs     = 8
    execution time     = 0.00 seconds
    computation rate   = 91.66 GFlops/sec
    data transfer rate = 6.70 GBytes/sec

Formatting large volumes with ext3

Friday, November 7th, 2008

In RedHat 5.1, the maximum file system size is increased to 16 TB from 8TB. However, getting mkfs to format a volume larger than 2 TB is not straight forward.

We do  ship large volumes to customers regularly. We recommend that customers use XFS for large volumes for performance and size considerations. However, sometimes customers want only ext3 because of the familiarity with the file system.

Before being able to format a volume,  you must be able to create a volume greater than 2 TB. fdisk cannot do this.

You will need to use GNU Parted (parted) to create partitions larger than 2 TB. Details on how to use parted can be found here and here

A simple example of using parted, we assume are working on /dev/sdb of size 10 TB from a RAID controller.

$> parted /dev/sdb

GNU Parted 1.8.9
Using /dev/sdd
Welcome to GNU Parted! Type 'help' to view a list of commands.

(parted) mkpart primary ext3 10737418240
(parted) print
(parted) quit

A straight forward mkfs command on any volume larger than 2 TB will yield the following error:

mkfs.ext3: Filesystem too large.  No more than 2**31-1 blocks
(8TB using a blocksize of 4k) are currently supported.

A simple workaround is to force mkfs to format the device in spite of the size:

mkfs.ext3 -F -b 4096 /dev/<block device>

mkfs.ext3 -F -b 4096 /dev/<path to logical volume> if you are using LVM

In order to use the above command you need to have e2fsprogs 1.39 or above. The above command also sets block size to 4kb.

You could also use -m0  to set the reserved blocks to zero.

Note that ext3 is not recommended for large volumes. XFS is better suited for that purpose.

Further reading:

RedHat Knowledgebase  Article



Compiling BLACS with OpenMPI and GCC on RHEL 5 / CentOS 5

Wednesday, March 12th, 2008

I had some problems compiling BLACS with OpenMPI and GCC on RHEL 5 / CentOS 5. So, here is how I got it to compile and pass the tests successfully:

OpenMPI: 1.2.5

BLACS: 1.1 with MPIBLACS Patch 03 (Feb 24, 2000)

GCC: 4.1.2

F77 = gfortran

F90 = gfortran

CC = gcc

CXX = g++

Bmake file used: BMAKES/Bmake.MPI-LINUX

Changes made to Bmake:


#  ————————————-
#  Name and location of the MPI library.
#  ————————————-
MPIdir = /home/test/openmpi-install/
MPIINCdir = $(MPIdir)/include






#=========================== SECTION 3: COMPILERS ============================
#  The following macros specify compilers, linker/loaders, the archiver,
#  and their options.  Some of the fortran files need to be compiled with no
#  optimization.  This is the F77NO_OPTFLAG.  The usage of the remaining
#  macros should be obvious from the names.
F77            = $(MPIdir)/bin/mpif77
F77FLAGS       = $(F77NO_OPTFLAGS) -O3 -mtune=amdfam10 -march=amdfam10
F77LOADER      = $(F77)
CC             = $(MPIdir)/bin/mpicc
CCFLAGS        = -O3 -mtune=amdfam10 -march=amdfam10
CCLOADER       = $(CC)
Of special importance are the flags:



If INTFACE is not set correctly, make tester will fail with following messages:

blacstest.o(.text+0x4c): In function `MAIN__':

: undefined reference to `blacs_pinfo_'

blacstest.o(.text+0x6e): In function `MAIN__':

: undefined reference to `blacs_get_'

blacstest.o(.text+0x8b): In function `MAIN__':

: undefined reference to `blacs_gridinit_'

blacstest.o(.text+0x94): In function `MAIN__':

More such errors follow.

If TRANSCOMM is not set correctly, make tester will complete sucecssfully and you will be able to successfully execute C interface tests also. When executing FORTRAN interface tests, the following messages are seen:

BLACS WARNING 'No need to set message ID range due to MPI communicator.
'from {-1,-1}, pnum=1, Contxt=-1, on line 18 of file 'blacs_set_.c'.
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=3, Contxt=-1, on line 18 of file 'blacs_set_.c'.
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=0, Contxt=-1, on line 18 of file 'blacs_set_.c'.
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=2, Contxt=-1, on line 18 of file 'blacs_set_.c'.
[comp-pvfs-0-7.local:30119] *** An error occurred in MPI_Comm_group
[comp-pvfs-0-7.local:30118] *** An error occurred in MPI_Comm_group
[comp-pvfs-0-7.local:30118] *** on communicator MPI_COMM_WORLD
[comp-pvfs-0-7.local:30118] *** MPI_ERR_COMM: invalid communicator
[comp-pvfs-0-7.local:30119] *** on communicator MPI_COMM_WORLD
[comp-pvfs-0-7.local:30119] *** MPI_ERR_COMM: invalid communicator
[comp-pvfs-0-7.local:30119] *** MPI_ERRORS_ARE_FATAL (goodbye) 

Hyper-V (Windows Server 2008 x64) on 32 cores

Friday, January 25th, 2008

In the previous post, we tried Hyper-V with only 16 cores as per the release notes. Now we addedd another 8 CPUS (16 cores AMD Opteron) to the same system. This was to test the x64 Windows Server 2008 on 32 cores than Hyper-V. We already did this for the x86 version here.

The system did boot up just fine. Here is a screen shot.

Windows Server 2008 x64 on 32 AMD Opteron cores

With that taken care of, we quickly browsed through the event logs to see if the Hyper-V service / hypervisor failed to start as per the release notes. There was no such message. The only way to test if the hypervisor has started or not is to fire up the Server Manager and try to boot up the virtual machines.

We were pleased to see that the hypervisor indeed started and there was no problem booting up the virtual machines. And here is a screenshot.

Hyper-V with 32 AMD Opteron cores

This opens up a wide range of usage cases. With appropriate capacity planning, the entire data center for a small company can be replaced with one 5U server and two for a highly available setup.



Hyper-V on 16 core AMD Opteron system (A5808-32)

Thursday, January 24th, 2008

After a successful install and test of x86 Windows Server 2008, it was time to put the x64 version through the same test.

We will talk about the x64 installation and experiences in another post.In this post, we will focus on the Hyper-V installation, configuration and experiences on our 16 core AMD Opteron server, A5808-32.

Hyper-V is the new Microsoft hypervisor technology included with certain SKUs of Windows Server 2008 x64 version. Hyper-V, included in the Windows Server 2008 RC1, is still in beta stages.


Hyper-V is a new server role. This role has to be added after Windows Server 2008 installs and boots up. This installs the Hypervisor and reboots the server.

Hyper-V installation

Now a new role shows up under “Roles” in the “Server Manager”. The new role is “Hyper-V” and has the new category “Microsoft Hyper-V Servers”. Your server will show up under this category. In the future, we may see other categories than just “Microsoft Hyper-V Servers”. This could be a place holder for the future management framework talked about here.

Creating a new VM is pretty straight forward. Select New->Virtual machine from the Server Manager and follow the wizard. Here is a screen shot of FedoraCore 6 x86_64 installation. Look, Ma … Linux on Windows!

New VM - Hyper-V

New VM boot up - Hyper-V

FedoraCore 6 on Hyper-V

Next up, Windows XP. Here is a screenshot of Windows XP install and FC6 & Windows XP VMs active on the Hyper-V server.

Windows XP installation on Hyper-V

FC6 and Windows XP on Hyper-V

Over all experience:

It works well. We did not face any major issues with installation of Hyper-V or of VM’s. The management capabilities are polished. High Availability (HA) is implemented as a part of the Windows HA services.

There was no problem with mouse or keyboard inputs. No sticky mouse issues.

Certain Linux distros need noacpi flag to boot as VMs under Hyper-V.

Microsoft integration services CD ROM was not recognized by FC6. It logs “this disc does not contain any tracks I recognize.” Anyway, the integration services are suported only on Windows Server 2003 SP2 and Windows Server 2008.

Installation of both Windows XP and FC 6 was much longer than on VMWare or an actual system.

For a beta release, Hyper-V is suprisingly fully usable and comes fully integrated with Window Server 2008.

As per release notes, Hyper-V does not support more than 16 cores. So we configured our 32-core system with only 16 cores.

Future Work:

Test HA services

Capacity Planning

Quick Migration


We used remote desktop connection to connect to the server and manage it. Once the mouse is captured inside the VM, there is no way to release it without going to the console of the Windows server and releasing it from there. Also the we could not use the mouse inside the FC6 from Remote Dektop. We could not find what the key sequence was to send Ctrl+Alt+<– to the remote machine using RemoteDekstop. Hopefully once a standalone application, like VI3, is released for Windows Hypervisor, it would be much easier to manage.


If you are looking for detailed installation instructions, this post is useful.

Windows Server 2008 Release Candidate 1 on 32 cores

Wednesday, January 23rd, 2008

Windows Server 2008 hit RC1 milestone recently. Windows Server is a popular choice on our 8 socket Opteron server, A5808-32, for a number of customers. RC1 is a good time for small vendors to test the compatibility of the OS with their servers and storage.

Here is a screen shot of Windows Server 2008 RC1 with 32 cores – 8 sockets, quad core AMD Opteron. This is the x86 version with full install.

Windows Server 2008 RC 1 32 cores

Notice in the screen shot a report from CPU-Z utility – 8 processors, 4 cores and processor model. Windows version is displayed in the winver dialog and on the extreme left is the list of processors from the windows device manager.

Installation experiences:

  • First install hung at 60%. Cold restart
  • No video driver. Standard VGA. Well, you wouldn’t want Aero interface on your server anyway. We did not try to install the Windows 2003 version of the driver as the display was very usable.
  • All NIC’s successfully detected
  • On-board SATA successfully detected
  • 32-cores (8 sockets) successfully detected
  • System was configured with only 16GB instead of the maximum 256GB. We do not see any issues for use with 256GB
  • iSCSI initiator successfully mounted a remote volume
  • Windows “System” control panel applet does not display processor or memory details.

Now to try the x64 and Hyper-V versions. Stay tuned.