Acceptance Tests: Signed, Sealed and Delivered
August 2008
ARCCA are please to announce the successful completion of the acceptance tests phase and that "Merlin" has now entered the final two month successful running stage. Although taking approximately two weeks longer than anticipated, it was a great achievement to get such a complex system through a stringent set of tests in a relatively short period of time. This involved a lot of commitment and dedicated effort from both Bull engineers (France & UK) and ARCCA staff, with regular progress review meetings to ensure problems were quickly acknowledged and resolved. Challenges were certainly thrown up by some of these tests and it is of great credit to those involved in the process that these were resolved in such a timely fashion.
To underline this achievement, below we provide a summary of the acceptance procedure and a description of the stages the system had to pass prior to being signed off.
If you are considering procuring a HPC system we are happy to discuss this important stage in more detail, please contact us for further information.
From Installation to Production: The Key Stages
As part of the SRIF-3 contract placed with Bull, detailed performance criteria were agreed as part of the contractual commitment.
In a standard procurement, setting acceptable performance standards for such a system is a reasonably straight forward process. However as this procurement was held on the cusp of a technology change, there were no comparable quad-core systems against which performance could be gauged. Instead, in collaboration with the benchmarking team in Bull, performance extrapolations were undertaken on a set of core codes which Bull committed to achieve once the system had been installed and deemed operational.
Experience highlighted the need to invest time and effort in properly scoping out the acceptance test stages, eliminating bugs prior to start of service, thereby ensuring (as far as possible) the system is stable, robust, reliable and delivering expected performance. To achieve this 5 key stages were defined so as to provide strenuous tests of the entire solution – tests which the system would need to repeatedly demonstrate before being deemed acceptable by Cardiff.
Acceptance Test Phases
| Stage | Test | Description |
| 1 | Performance Validation | Demonstrate agreed performance on a core set of codes / datasets - including the impact of multiple simultaneous interactive compilations. |
| 2 | Workload Validation | Repeated reproducible throughput times via the job scheduler with a mixed workload. Also required an uninterrupted period of cluster utilisation over 5 working days |
| 3 | "Health Check" & Stability | Confirm the functionality of the cluster administration tools and system stability. |
| 4 | Multiple Operating Systems | Test the ability to reboot / repartition compute nodes into different linux-based OS distributions on the fly. |
| 5 | Environmental Tests | Check the environmental aspects of the cluster (power usage & failures, threshold alarm conditions) and the high-availability (HA) configuration of all resilient components. |
Running the Acceptance Tests
It was recognised early on in the process that such strenuous tests were not routinely performed by any of the vendors and in order to ensure the tests remained on schedule, daily conference calls were initiated. This way, any problem which developed on the system was rapidly documented and investigated by both Bull-UK & Bull-France engineers.
Performance Validation
The first stage was to confirm that the application codes featuring in the acceptance test validated correctly on the cluster with acceptable performance (both on the home NFS partition, the /scratch partition on the Lustre File system, as well as writing to local disk under /tmp). Early tests were promising, but a couple of codes were consistently under-performing. Investigations showed this was due in part to one of the Bull-MPI routines (Alltoall). The MPI development team in France worked on improving the performance of this collective and re-wrote it to more effectively exploit quad-core technology. In parallel with this, the need for a second MPI implementation on the system was recognised so as to manage the possibility of new codes exhibiting a strong dependency on MPI routines which had yet to be optimised. To rectify this oversight in the software environment, Intel MPI was also procured and installed on the system.
Throughput Testing:
One of the most complex tests in the process proved to be the throughput or workload tests. A great deal of effort was invested in generating a script that contained a mix of applications, datasets and processor counts, which would consistently run for 6 hours. Numerous factors could cause this test to fail, so ensuring the starting conditions were identical for each run proved crucial. To facilitate this, it was agreed that a 512 core partition would be created on the cluster for this benchmark. If the workload test was to be run over the entire cluster, then should a single node develop a problem, the test would be on hold until this was remedied; this was deemed too great a potential risk to the start of service.
Other factors which would cause this test to fail would be if one job in the mix hung, as all the tasks needed to complete within a set time, or if the scheduler was not sensibly load balancing the jobs on the system.
Initial problems were encountered on the lustre filesystem (but not NFS or /tmp) with two key codes failing during input processing at the start of the job. This strongly suggested a problem on the lustre filesystem and due to the random nature of the failure was feared that it would be extremely difficult to determine the underlying cause. Within a week, this problem had been identified and resolved - with the root cause not the cluster filesystem, but a fault on one of the Connect-X cards (not a natural conclusion from the original error conditions).
Stability & Environmental Tests
Due to the system administration emphasis of these two stages, they tended to get performed in tandem. These tests did not start in earnest until the latter stages of the acceptance process, when most of the performance problems had been diagnosed. Naturally the basic system administration functionality had been demonstrated during the installation and commissioning of the system. However these tests did reveal a number of key characteristics of the system which were unknown at the start of the process, and uncovered a number of High-Availability (HA) failover routines which did not initially work as advertised.
The first test which revealed the importance of a thoroughly documented procedural requirement was to cleanly shutdown and restart the system. The order in which processes are stopped and systems unmounted is critical, as the Lustre parallel filesystem is extremely sensitive and if not stopped cleanly does not remount. A great deal of time was dedicated by both ARCCA and Bull-UK (on site) in performing these tests. Since the infrastructure housing the Bull kit was from a 3rd party and the first time that it had been deployed by Bull, a lot of time was invested in ensuring the Bull management software would be able to collect and interpret information sent from the environmental monitoring software from APC (ISX). Wrappers and scripts were implemented which would interpret signals sent from the software so in the event of an emergency, the cluster would cleanly and quickly shutdown (quickly since in the event of a generator failure, the UPS has a relatively short run time at maximum load). As an interim measure whilst the ISX management software was being installed and configured, a shutdown script was developed based on interrogating internal monitoring tools (IPMI-based) within the Bull system, so should a set of alarm criteria be exceeded the system would again power down. This script was inadvertently tested when external engineers accidentally caused the chillers to shutdown, and consequently the Bull system very quickly reached it maximum operating temperature threshold, at which point it shutdown.
A number of minor configuration changes needed to be implemented in order to ensure service migrations from the primary master node to the secondary server – in particular work needed to be performed on both the Lustre and NFS filesystems as well as the PBS (job scheduler) failover using the flexlm license server.
All components of the cluster which were designed to be resilient (storage, disks, dual power, fans etc.) were physically tested to ensure the correct steps were undertaken in the event of a failure.
