 |
 |
LINUX CLUSTER MANAGEMENT
PSSC Labs continues to develop the most full featured Beowulf cluster management software in the industry. CBeST is a proprietary cluster management package comprised of kernel optimizations, custom scripts and some open source tools that are elegantly integrated, optimized and customized for your specific cluster hardware and application. CBeST includes message passing interfaces, batch scheuler, node monitoring tools and node repair and replication tools. CBeST also includes a variety of proprietary scripts developed by PSSC Labs to ease cluster administration. All CBeST components must undergo PSSC Labs thorough testing and approval procedures before any equipment is shipped. CBeST includes unlimited lifetime support at NO additional cost. CBeST is the Intellectual Property of PSSC Labs and is available on PSSC Labs PowerWulf Clusters. Below are the components included with CBeST. For specific information please select the text link.
. Operating System
. Kernel Optimization
. Custom Scripts
. Message Passing
. Portable Batch Schedulers
. Network File Systems
. C & Fortan Compilers
. Hardware & Temperature Monitoring Utility
. Node Imaging & Replication Utility
. Remote Power Management Utility
. Security Settings
. Recovery Utility
. CBeST User Manual
. Unlimited Lifetime Support
COMPLETE POWERWULF RECOVERY (CPR)
Complete PowerWulf Recovery (CPR) is an easy to use software/hardware utility that allows PowerWulf cluster users to quickly restore head node and slave node software images to their original factory installed image. This recovery includes all components of PSSC Labs Complete Beowulf Software Toolkit (CBeST) including operating system, custom scripts, message passing software, batch scheduling system and cluster management/monitoring utilities. CPR comes with documented instructions and will take less than 15 minutes to run.
PSSC Labs supports several operating systems on their industry recognized PowerWulf Clusters including: Fedora, Redhat Enterprise Linux and Microsoft Windows Cluster.
The Fedora Project is sponsored by RedHat and developed by the open source project. Fedora serves as a testing ground for future releases of commercially available Redhat products. PSSC Labs supports Fedora on all their Beowulf Linux Clusters, Workstation, Servers and Storage devices. In most instances, Fedora is sufficient as an alternative to the commercially available OSs. To date, PSSC Labs has shipped over 300 clusters based on Fedora.
Redhat Enterprise Linux Server is available for PSSC Labs PowerWulf Clusters, Workstations, Servers and Storage devices. Redhat offers a wide variety of support options to satisfy university users to the most demanding corporate environments.
Most Linux Supercomputer manufacturers focus on the nuts and bolts of a Linux Supercomputer. PSSC Labs pays careful attention to your supercomputer's hardware stability and performance. Once the hardware is fully tested and benchmarked, PSSC Labs Linux Experts go to work; installing and optimizing your supercomputers' Linux kernel to match your specific hardware and software needs. Our Linux Experts carefully customize all aspects of the Linux kernel making adjustments to enhance performance by turning on/off kernel options to best match your hardware. This prevents unnecessary system resource drain. PSSC Labs Cluster Experts utilize ACPI (advanced configuration and power interface) to enhance performance through better resource management at the motherboard level. Hardware driver updates are made to your Linux Supercomputer for maximum performance. PSSC Labs Linux Experts also update and patch file system drivers and userland utilities. One important note, this optimized kernel still remains completely open source. You can make any necessary adjustments.
Kernel optimization is a time consuming processor. However, through this process PSSC Labs Linux Experts can improve cluster performance by as much as 15%. Some Linux builders skip this important step and simply install a prepackaged version of Linux. The reason why PSSC Labs takes this extra time is to deliver the most complete, turn-key high performance computing solution; not just a bunch of computers.
PSSC Labs includes several custom scripts to ease cluster maintenance. Although the scripts are custom, PSSC Labs can provide source information if required. Browse below at a sample of scripts included with each PowerWulf Cluster.
1) A node synchronization cron script allows for remote file copy across the cluster; keeping all nodes updated to the latest configuration.
2) Parallel command scripts using the Ganglia Execution Environment (gexec), the Parallel Distributed Shell (pdsh) and ssh/rsh/rlogin commands to launch and execute administrative commands across cluster nodes.
3) Health monitoring scripts are installed to work with Ganglia to record and graphically represent cluster metrics dating back up to one year. These metrics include processor, motherboard and hard drive temperatures. Processor and chassis fans status can also be monitored through Ganglia. You can also add your own script to track additional metrics.
4) Remote power management scripts installed for easy power on / power off and node reboot.
MPI (1 & 2) is a portable implementation of the Message Passing Interface standard. MPICH was developed by Argonne National Laboratory. It is designed to be highly portable and scalable. MPICH is often run over TCP/IP over Ethernet.
Open MP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C/C++ and Fortran on many architectures, including Unix and Microsoft Windows platforms. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.
Parallel Virtual Machine (PVM) is one of the original parallel computing tools developed by the Unversity of Tennessee, Emory University and Oak Ridge National Labs. PVM is now largely replaced by MPI.
LAM_MPI is a parallel programming environment and development system for a network of systems. LAM_MPI allows a computer network to act as a parallel supercomputer.
TORQUE (Tera-scale Open-source Resource and QUEue manager) is a resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original OpenPBS project and has incorporated significant advances in the areas of scalability, fault tolerance,
MOAB is an intelligent management middleware that provides simple Web-based job management, graphical cluster administration and management reporting tools. Organizations will benefit from the ability to provide guaranteed service levels to users and organizations, higher resource utilization rates, and the ability to get more jobs processed with the same resources, resulting in an improved ROI.
Sun's Grid Engine (SGE) coordinates the resources of multiple computers in the solution of single problems. Jobs submitted to the Grid's master host are separated into pieces and distributed to the various slave hosts. This can result is significant decreases in time required for a job to run. SGE is availabe for download as an open source package.
ext3 - The ext3 filesystem is a journaling filesystem that is 100% compatible with all of the utilities created for creating, managing, and fine-tuning the ext2 filesystem, which is the default filesystem used by Linux systems.
Global File System (GFS) - Red Hat GFS allows Red Hat Enterprise Linux servers to simultaneously read and write to a single shared file system on the SAN, achieving high performance and reducing the complexity and overhead of managing redundant data copies. Red Hat GFS has no single point of failure, is incrementally scalable from one to hundreds of Red Hat Enterprise Linux servers, and works with all standard Linux applications.
Network File System (NFS) - Distributed file system that allows NFS servers to give access to their local file system to NFS clients over a network using TCP/IP.
Parallel Virtual File System (PVFS) - PVFS is a parallel file system. It allows applications, both serial and parallel, to store and retrieve file data which is distributed across a set of I/O servers. This is done through traditional file I/O semantics, which is to say that you can open, close, read, write, and seek in PVFS files just as in files stored in your home directory. The primary goal of PVFS is to provide a high performance "global scratch space" for clusters of workstations running parallel applications.
Open Source - Linux distributions do include Fortran and C compilers. All PSSC Labs clusters, workstations and servers include the open source compilers.
Absoft - Absoft's latest Linux Fortran Compiler (Fortran 95 v9.0 for 64-bit x64 processors from AMD and Intel) combines superior performance, solid reliability and the industry's most complete list of tools and libraries into a single package.
Intel - Gain optimal performance on Intel® processors from your Linux* applications using Intel® C++ and Fortran Compiler for Linux. PSSC Labs Cluster Technicians have also installed the Intel compilers on the AMD platform.
Lahey - LF95 for Linux includes Full Fortran 95/90/77, Automatic Parallelization, OpenMP 2.0 compatibility, New global compile-time diagnostics, File I/O speed improvements, Thread-safe BLAS and LAPACK, Improved runtime diagnostics, W interacter Starter Kit, Thread-safe SSL2 math library, and more.
Pathscale - PathScale's goal is to make it easier to develop and deploy 64-bit applications into clustered environments. PathScale has developed the industry's highest-performance C, C++, and Fortran 9X compilers for 64-bit Linux-based computer systems.
Portland Group - Optimizing Fortran, C and C++ Compilers for 32-bit x86, 64-bit AMD64 and 64-bit Intel* EM64T processor-based Linux* and Windows* computer systems
**Above descriptions supplied by vendor's website
Ganglia allows you to quickly and easily view all important operating conditions of your Linux Supercomputer through a GUI. You can remotely access details regarding cpu usage, cpu temperatures, chassis and cpu fan speeds, hard drive temperatures, memory utilization, hard drive swap space and many more metrics. Ganglia even allows you to view your own set of metrics.
Temperature Monitoring
Maintaining a good computing environment is key to the success and longevity of your Linux Supercomputer. Application runs could require thousands of compute hours. This means your Linux Supercomputer needs to be as reliable as possible. CBeST includes LM_Sensors to monitor CPU temperatures and case fans. PSSC Labs Linux Experts configure LM_Sensors to send a warning prompt in the event of CPU temperature overheating.
Linux Supercomputer administrators face many issues including maintaining a consistent OS kernel on all nodes. In addition, Linux administrators may face the daunting task of repairing a corrupted file system, replacing a failed hard drive and adding new nodes to an existing Linux Supercomputer. CBeST eases this process with the use of System Imager. System Imager enables you to manually create an entire slave node kernel image on the head node and then push this image to every slave node of your supercomputer. You can update the slave node systems by syncing them to an updated image on the head node. These updates are extremely fast because only the configuration portions that have changed will be pushed to the slave nodes. Complete details on using System Imager are included in the CBeST user manual.
Remote power management scripts installed for easy power on / power off and node reboot. These scripts include:
node-wakeup
This script uses WOL (Wake-On-Lan) to remotely turn on nodes. It uses the MAC addresses found in /etc/dhcpd.conf. WOL is not a reliable service. It uses Ganglia to build a node list of active nodes.
Arguments: all or short node name.
Examples:
:: node-wakeup all
:: node-wakeup n12
node-reboot
This script reboot all the currently running compute nodes. It uses gsh or rsh/ssh for remote command execution.
Arguments: all or short node name.
Examples:
:: node-reboot all
:: node-reboot n12
node-poweroff
This script halts (and powers off when supported) all the running compute nodes. It uses gsh or rsh/ssh for remote command execution.
Arguments: all or short node name.
Examples:
:: node-poweroff all
:: node-poweroff n12
Port mapper and ipchains / iptables are configured to help keep your supercomputer secure. Cron scripts are included to keep user accounts and configuration files synchronized. Examples for allowing client machines access to your cluster through NFS resources are provided in your user manual.
Complete PowerWulf Recovery (CPR) is an easy to use software/hardware utility that allows PowerWulf cluster users to quickly restore head node and slave node software images to their original factory installed image. This recovery includes all components of PSSC Labs Complete Beowulf Software Toolkit (CBeST) including operating system, custom scripts, message passing software, batch scheduling system and cluster management/monitoring utilities. CPR comes with documented instructions and will take less than 15 minutes to run.
CBeST User Manual has been called "exactly what I want to see" by system administrators. Detailed information on CBeST tools, installation, customization, operation and troubleshooting are included with every PSSC Labs Linux Supercomputer.
Your CBeST user manual is your ultimate supercomputer resource. From the minute your PSSC Labs supercomputer arrives at your door, your user manual will guide you through the initial installation of the supercomputer to troubleshooting MPI and PBS jobs. Developed by the same PSSC Labs Linux Experts that installed the software on your Linux Supercomputer, the user manual is designed to assist and educate any and all levels of supercomputer administrators. If you are a research scientist, focusing on your own work and not administering a supercomputer, CBeST documentation is the perfect first resource guide. If you are a full time system administrator, this manual will walk you through all aspects of the CBeST installation and give you plenty of options to make your own changes.
All PSSC Labs Linux Supercomputers include complete email and phone support for the lifetime of your Linux Supercomputer. This includes all questions related to CBeST including MPI and Open PBS. There is never a fee for this support no matter how many times you contact PSSC Labs. The experience PSSC Labs gains through delivering 300 Linux Supercomputers enables our Linux Experts to support any issues you may have. We can support your technical questions regarding the Linux operating system, message passing software, security settings and performance. PSSC Labs prides itself on providing the best level of support possible. Give us a test and contact a PSSC Labs Linux Expert today.
|
 |
|