Monday, April 23, 2007

Homegrown high-performance computing

Once the domain of monolithic, multimillion-dollar supercomputers from Cray and IBM, HPC (high-performance computing) is now firmly within reach of today's enterprise, thanks to the affordable computing power of clustered standards-based Linux and Microsoft servers running commodity Intel Xeon and AMD Opteron processors. Many early movers are in fact already capitalizing on in-house HPC, assembling and managing small-scale clusters on their own.

Yet building the hardware and software for an HPC environment remains a complex, highly specialized undertaking. As such, few organizations outside university engineering and research departments and specialized vertical markets such as oil and gas exploration, bioscience, and financial research have heeded the call. No longer borrowing time on others' massive HPC architectures, these pioneers, however, are fast proving the potential of small-scale, do-it-yourself clustering in enterprise settings. And as the case is made for few-node clusters, expect organizations beyond these niches to begin tapping the competitive edge of in-house HPC.

The four case studies assembled here illustrate the pain and complexity of building a successful HPC environment, including the sensitive hardware and software dependencies that affect performance and reliability, as well as the painstaking work that goes into parallelizing serial apps to work successfully in a clustered environment.

Worth noting is that, although specialized high-performance, low-latency interconnects such as Myrinet, InfiniBand, and Quadric are often touted as de-facto solutions for interprocess HPC communications, three of the four organizations profiled found commodity Gigabit Ethernet adequate for their purposes -- and much less expensive. One in fact took every measure possible to avoid message passing and cutting-edge interconnects in order to enhance reliability.

New to the HPC market, Microsoft Windows Compute Cluster Server 2003 proved appealing to two organizations looking to integrate their HPC cluster into an existing Microsoft environment. So far, results have been positive.

Finally, one organization found that delegating much of the hardware and software configuration to a specialized HPC hardware vendor/integrator made the whole process considerably easier.

BAE Systems tests and tests some more

When it comes to delivering advanced defense and aerospace systems, the argument in favor of developing an in-house HPC cluster is overwhelming. Perhaps it's not surprising then to find that the technology and engineering services group at BAE Systems already has a fair amount of experience constructing HPC clusters from HP Alpha and Opteron-based Linux servers. Integrating previous HPC systems into the , a U.K.-based global defense company's enterprise, however, has proved costly.

"We've found the TCO implications of maintaining two or more disparate systems -- such as Windows, Linux -- and Unix, to be too high, particularly in terms of support people," says Jamil Appa, group leader of technology at engineering services at BAE. "We're looking to provide a technical computing environment that integrates easily with the rest of our IT environment, including systems like Active Directory."

The group is currently assessing two Microsoft Windows Compute Cluster Server 2003 clusters -- both of which have been in testing for several months now. Tools built from Microsoft .Net 3.0 Workflow Foundation and Communications Foundation have enabled BAE engineers to create an efficient workflow environment in which they can collaborate effectively during the design process and access relevant parts of the systems from their own customized views with tools relevant to their tasks. One test bed is a six-node cluster of HP ProLiant dual-core, dual-processor Opteron-based servers; the other is a 12-node mix of Opteron- and Woodcrest-based servers from Supermicro.

If there's anything that BAE has learned from its testing, it's that little changes can have big performance implications.

"We're running our clusters with a whole variety of interconnects, including Gigabit Ethernet, Quadric, and a Voltaire InfiniBand switch," Appa says. "We've also been running both Microsoft and HP versions of MPI [Message Passing Interface]. We've found that all these elements have different sweet spots and behave differently depending on the application." In the long run, this testing will enable the technology and engineering services group to provide other BAE business units looking to implement HPC with their own personal HPC "shopping lists."

As for interfaces, "depending on the application, the size of your cluster (preferably small), and the types of switches you use, Gigabit Ethernet really isn't that bad," Appa says. His group has been using Gigabit switches from HP, which "for our purposes, are very good."

Appa has also tested several compilers, and he cautions not to skimp on these tools: "A US$100 compiler might make your code run 20 percent slower than a top-end compiler, so you end up having to pay for a machine that is 20 percent larger. Which is more expensive?"

Each of Appa's configurations sits on three networks: one for message passing, one for accessing the file system, and one for management and submitting jobs. To access NAS, Appa uses iSCSI over Gigabit Ethernet, rather than FC (Fibre Channel), and has a high-performance parallel file system consisting of open source Lustre object storage technology. Why? "As clusters get larger and you have more cores running processes that are all reading one file on your file system, your file system really needs to scale or you'll be in trouble," Appa explains.

Meanwhile, Windows Compute Cluster has simplified both cluster management and user training -- which makes for additional benefits when it comes to freeing up staff for the more vital task of optimizing BAE apps. Although BAE's software is already set up for HPC, Appa believes the whole process of parallelizing existing apps is reaching a turning point. "Our algorithms date back to the '80s and do not make best use of multicore technologies," he says. "We're all going to have to reconsider how we write our algorithms or we'll all suffer."

Although each endeavor to bring HPC in-house will differ based on an enterprise's clustering needs, BAE's Appa has some sage advice for anyone considering the journey.

"You can't assume that somebody will come along with a magic wand and give you the perfect HPC solution," Appa says. "You really need to try everything out, especially if you have in-house code. There's so much variation and change in HPC technology, and so much is code-dependent. You really have to understand the interaction between the hardware and software."

Luckily, those attempting to bring HPC in-house will not be alone. "The HPC community itself is quite small and very open and willing to share valuable information," Appa says.

Appa points out that Fluent has an excellent benchmarking site that demonstrates performance variations among various hardware and software combinations. In his case, the Microsoft Institute for High Performance Computing at the University of Southampton provided sound advice on what hardware worked and what didn't, particularly during the beta phase.

Virginia Tech starts from scratch

At Virginia Tech's Advanced Research Institute (ARI), constructing an HPC cluster for cancer research has been an educational experience for the electrical and computer engineering grad students involved.

With little prior HPC experience, the students built a 16-node cluster and parallelized apps they had written in MATLAB, a numerical programming environment, over the course of several months. The project taps huge amounts of data acquired from biologists and physicians to perform molecular profiling of cancer patients. The students are also working on vehicle-related data for transportation projects.

Rather than make every aspect a learning experience, when it came to choose an HPC platform, the students and professors decided to stick with what they already knew: Microsoft Windows.

"Our students had already been running MATLAB and all their other programs on Windows," says Dr. Saifur Rahman, director of ARI. "We didn't want to have to retrain them on Linux." As was the case at BAE Systems, there were also obvious advantages to a cluster that could integrate easily with the rest of ARI's Windows infrastructure, including Active Directory.

Microsoft had already approached Virginia Tech to be an early adopter of Windows Compute Cluster Server 2003, so Dr. Rahman and his team said yes and started looking for the right hardware. They vetted several vendors, but when they found out Microsoft was performing its own testing on Hewlett-Packard servers, they decided to go with HP. "We knew we'd need help from Microsoft to fix various bugs," says Dr. Rahman, "and since all their experience was on HP servers, we felt we'd have the most success with HP."

So with help from Microsoft and HP, ARI installed 16 HP ProLiant DL 145 servers with dual-core 2.01GHz AMD Opteron 270 processors and 1GB of RAM each. On the same rack, ARI installed 1TB of HP FC storage. The rack also includes one head node, as well as an HP ProLiant DL385 G1 server with two dual-core 2.4GHZ AMD64 processors and 4GB of RAM.

As did BAE Systems, ARI decided to stick with Gigabit Ethernet for its cluster interconnect, mainly because it was what the team knew. "There are other interconnects that are faster, but we've found that Gigabit Ethernet is pretty robust and works fine for our purposes," Dr. Rahman says. And after some servers overheated, ARI placed the entire cluster in a 55-degree Fahrenheit chilled server room.

ARI found parallelizing MATLAB apps to be a significant challenge requiring a number of iterations. "The students would work on parallelizing the algorithms, then run case studies to verify the results they were getting with the clustered applications were similar to results they got when they ran one machine," Dr. Rahman says.

At first, the results weren't coinciding, and the students had to learn more about how to parallelize effectively and clean up what they had already coded. "We missed some important relationships at first," Dr. Rahman says. With some help from MATLAB, it took two graduate students about a month to get the app parallelization right.

Dr. Rahman feels that the team's diverse expertise was a large factor in the project's success. One of the grad students had deep knowledge of molecular-level data quality, biomarkers, and the relevance of different data types; another offered a lot of hardware expertise; and the IT person had much experience interacting with vendors effectively. MATLAB provided help in determining which toolboxes were relevant to the task.

"When we went to MATLAB, they were just getting started with HPC," Dr. Rahman says. "I hope they will start to pay more attention, as it would be nice if they were all ready so we didn't have to spend months on this."

There were also hardware communications glitches.

"At first we had some problems controlling the servers as they talked to each other and the head node," Dr. Rahman says. "Sometimes they wouldn't respond. In other cases we wouldn't see any data coming through." Solving the problem took a lot of reconfiguring and reconnecting. "Perhaps we were giving the wrong commands at first. We're not sure," he adds. There were also problems with incorrect server and software license manager configurations.

Dr. Rahman says that managing the cluster has been relatively trouble-free with Windows Compute Cluster Server 2003 and adds that if he could do this all over again, he'd send his students to Microsoft for a longer time to learn more of what Microsoft itself has discovered about building clusters with HP servers. The use of HPC has enabled ARI researchers to dive much more deeply into molecular data, not only analyzing differences in relationships among disparate classes of subjects, but also revealing more subtle but important variations within each class.

Uptime counts for Merlin

Whereas most HPC implementations are the province of scientists and engineers hidden away in R&D departments, Merlin Securities' HPC solution interfaces directly with its hedge fund customers. That's why 24/7 uptime and security was a key HPC design requirement for Merlin, right alongside performance.

"We had to be extremely risk-averse in designing our cluster and choosing its components," says Mike Mettke, senior database administrator at Merlin.

A small prime brokerage firm serving the hedge fund industry, Merlin must contend with several larger competitors that benefit significantly from the economies of scale. Morgan Stanley, Merrill Lynch, and Bear Stearns, for example, run large mainframes that analyze millions of trades at the end of the day and return reports via batch processing the next morning. Merlin stakes its competitive edge on using its HPC cluster to deliver trading information in real time and allowing customers to slice and dice data multiple ways to uncover valuable insights, such as daily analyst trading performance as compared with other analysts, other market securities, and numerous market benchmarks. "We focus on helping clients explain not only what happened but why it happened," says CTO Amr Mohamed.

To do this, Merlin built its own highly parallelizing analysis tools, which it runs on a high-performance Oracle RAC (Real Application Cluster) installed on a rack of Dell PowerEdge 1850 and 2850 dual-core Xeon servers. Data storage is provided by EMC CLARiiON 2Gbps and 4Gbps FC storage towers. Sitting on top of Oracle is Merlin's HPC task-scheduling software, also created in-house, and an Oracle data mart that serves as a temporary holding ground for frequently used data subsets, much like a cache. Most of the high-speed calculations run directly on the Oracle RAC, which is fronted by a series of BEA WebLogic app servers that take in requests from a set of redundant load balancers sitting behind the company's customer-facing Apache Web servers. Sitting in front of each of the three layers are sets of redundant firewalls.

Cluster performance is key to running complex calculations in real time, but for Merlin, performance could never come at the expense of enterprise-level reliability, scalability, and 24/7 uptime, requirements that led to several crucial design decisions.

First, tightly coupled parallel processing via message passing was simply out of the question. Instead Merlin's architects and programmers put tremendous effort into dividing processes in an "embarrassingly parallel" fashion without any interdependencies at all. This benefits scalability and reliability, as the high-speed, low-latency communications required for interprocess communications create scalability bottlenecks. They also require cutting-edge interconnects such as Myrinet and InfiniBand, which don't have the reliability track record of Gigabit Ethernet.

"We didn't want some new interconnect driver crashing the system," Mohamed says, adding that straight Gigabit has also helped Merlin achieve considerable cost savings.

Reliability and enterprise-grade support fueled Merlin's decision to stick with an Oracle RAC, which has high-quality fault-tolerant fail-over features; dual-processor Dell PowerEdge servers; high-end EMC CLARiiON FC storage; and F5 load balancers.

"There are lots of funky platforms for HPC out there and high-bandwidth data storage solutions that can pump data at amazing rates," Mettke says. "The problem is that you end up dealing with lots of different vendors, some of whom can't deliver the 24/7 enterprise-level support you need. That adds another element of risk."

Finally, all code was written using Java, C++, and SQL.

"I've been on the other end running code written in Assembler on thousands of nodes," Mettke says. "We want the speed, but not at the expense of system crashes in the middle of a trading day. You can claim you have the best cluster out there, but it doesn't matter if there's no show when it's showtime."

Mettke adds that the architecture of Merlin's HPC infrastructure is constantly evolving to accommodate new data and applications.

Aeriongets HPC help

For organizations looking to get a cluster up and running quickly, enlisting the help of specialized Linux HPC hardware vendors such as Linux Networx and Verari Systems can cut down development time significantly. Not only do these companies sell and configure standard hardware, but they often have the expertise to deliver turnkey configurations with apps installed, tuned, and tested. Such was the case for Aerion, a small aeronautical engineering company that tapped Linux Networx to bring the upside of in-house HPC to its business of developing business jets.

Aerion, which works on the preliminary jet design process, relies on larger aerospace partners for design completion, as well as manufacturing and service. One of the company's projects, an early-stage design for a supersonic business jet, required particularly demanding CFD (computational fluid dynamics) analysis.

"In many commercial subsonic transport projects, you can develop different parts of the jet independently, then put all the pieces together and refine the design," says Aerion research engineer Andres Garzon. "But with supersonic jets, everything is so integrated and interactive that it's really impractical to develop each element apart from the others."

At the time, Aerion had been running commercial CFD software from Fluent on two separate dual-processor 3.06GHz Xeon Linux workstations. This setup worked well for analyzing diverse configurations and components and running Euler equations, which model airflow but leave out some essential fluid properties such as viscosity. "To really be accurate, you need to run the more complex Navier-Stokes calculations, which have many more terms to solve," Garzon says. And achieving the computing performance necessary to tackle that level of complexity meant turning to HPC.

Of course, small organizations such as Aerion don't always have the resources on hand to fly solo on HPC -- not to mention the fact that Aerion was also in the process of switching from Fluent to a series of powerful, free tools developed by NASA. So, when Garzon stumbled on a Linux Networx booth at an American Institute of Aeronautics and Astronautics meeting three years ago and the Linux Networx reps he spoke with offered to provide the hardware and much of the integration and testing work for the NASA apps Aerion wanted to use, Garzon took them up on the opportunity to get HPC up and running quickly.

Working with Linux Networx, Aerion configured an 8-node Linux Networx LS-P cluster of dual-processor AMD Opteron 246-based servers with 4GB per node, plus a ninth server to act as a master node. The NASA code requires a significant amount of complex message passing among parallel processes using the MPI, which usually requires a very high-speed, low-latency interconnect, such as InfiniBand or Myrinet. Because Aerion's budget was limited, Linux Networx offered to benchmark the apps with Myrinet, InfiniBand, and Gigabit Ethernet. Although performance under Myrinet and InfiniBand was superior (and roughly equivalent between the two), the overall difference was not dramatic enough to justify the expense. So, Linux Networx delivered a Gigabit Ethernet configuration, saving around $10,000, Garzon estimates.

As for storage, it is all local -- rather than SAN-based -- and is managed by the master node, which mirrors the OS and file system to the compute nodes. Thus, data is stored both on the local drives and the master node.

Linux Networx recompiled the NASA code -- which was originally developed to run on SGI machines -- for the Linux cluster. It also set up appropriate flags for the system and fine-tuned the cluster so that Aerion would be operational in a few days. Management is provided by Linux Networx Clusterworx, which monitors availability on the nodes, creates the image and payload for each node, and reprovisions nodes as necessary.

In all, Garzon found the process of bringing HPC in-house with the aid of Linux Networx to be relatively trouble-free and plans to expand the system to run additional cases simultaneously and to reduce compute time on time-sensitive calculations.

0 comments: