TANGO project aims to deliver the best way to run application with low power consumption. TANGO helps defining the right hardware (CPU, GPU, FPGA …) according to the application needs. The Device Supervisor (defined in the global TANGO architecture) is in charge of managing the available resources from their allocation to their usage. One of the key parameter to follow is the energy consumption.
Energy consumption has gradually become a very important parameter in High Performance Computing platforms. The Device Supervisor (DS) is the HPC middleware that is responsible for distributing computing power to applications and has knowledge of both the underlying resources and jobs needs. Therefore it is the best candidate for monitoring and controlling the energy consumption of the computations according to the job specifications. The integration of energy measurement mechanisms on DS and the consideration of energy consumption as a new characteristic in accounting seemed primordial at this time when energy has become a bottleneck to scalability. Since Power-Meters would be too expensive, other existing measurement models such as IPMI and RAPL can be exploited by the DS in order to track energy consumption and enhance the monitoring of the executions with energy considerations.
Device Supervisor is based on SLURM, which is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. It has been initially developed as a collaborative effort primarily by Lawrence Livermore National Laboratory, SchedMD, Linux NetworX, Hewlett-Packard, and Groupe Bull. It is now widely deployed, and used as the workload manager on about 60% of the TOP500 supercomputers. As of June 2016, SLURM is used on 5 out of 10 most powerful clusters of TOP500. SLURM focus on scalability is by design, it can theoretically scale up to 65.536 nodes, and provides the capability to handle up to 120.000 jobs per hour, therefore being a perfect fit to run demanding workloads that can take advantage of very large super clusters. To follow power and energy consumption, SLURM provides a powerful monitoring framework. While a job executes, the data collection plugins are periodically called on each node. The data are stored in a node HDF5 file on a shared parallel file system.
Energy consumption is a global value for a job so it is obvious that it can be stored in the database as a new job characteristic. However, power consumption is instantaneous and since we are interested to store power consumption for application profiling purposes; a highly scalable model should be used. Furthermore since the application profiling will also contain profiling statistics collected from various resources (like network, filesystem, etc), a mechanism with extensible format would be ideal to cover our needs. HDF5 is widely used in HPC supporting structured data. It has a versatile data model that can represent very complex data objects and a wide variety of metadata. It has a completely portable file format with no limit on the number or size of data objects stored and is open source.
The figure 1 shows an overview of SLURM’s deamons, processes and child threads as they are deployed upon the different components of the cluster. In the figure the real names of deamons, processes and plugins have been used (as named in the code) in order to make the direct mapping with the code.
The sampling frequency is an important parameter of profiling and the choice of implementation was to allow users to provide the particular frequency they need for each different type of profiling. Currently the lowest possible sampling frequency is 1 second, which according to other studies is long for the RAPL case, where the sampling frequency should be around 1ms in order to obtain the highest precision possible. Nevertheless we argue that having a thread polling 1000 samples per second would probably introduce a large overhead and we wanted to be sure about the overall overhead and precision with 1 sample per second before we try to optimize the measurements.
So during the execution of the job, each profile thread creates one hdf5 file per node and at the end of the job a merging phase takes place where all hdf5 files are merged into one binary file, decreasing the actual profile size per job. The profiling takes place only while the job is running, which is a big advantage opposed to other profiling techniques such as collectl where the profiling is continuous resulting in enormous volume of data that is not even correlated to jobs.
The figure 2 shows the results of collected data for two nodes.
These data can be exported in order to be use in a more graphical way
Thanks to the profiling features of SLURM, the TANGO users will be able to understand how their application consumes on heterogeneous hardware. The next steps will be to co-schedule the different part of the application on adapted hardware.