Productivity when programming parallel application for heterogeneous platforms

Recent trends in computing science that aim to improve computation performance are based on incorporating different types of computing devices such as GPUs, FPGAs and CPU architectures which are specialized for accelerating different type of algorithms or reducing the energy consumption. Moreover, with the rise of Cloud and Fog computing, these devices can be distributed geographically and connected through the internet by means of different types of networks.

Complex to program
Programming applications for taking profit of distributed heterogeneous computing systems is becoming a very complex and hard task. It is mainly due to the following issues that developers have take into consideration:

Parallelization
Applications have to be distributed in different software bits which can be executed in parallel.

Remote execution
Besides the identification and implementation of the parallel parts, developers have to manually execute the different software bits or implement a mechanism to automatically spawn the processes.

Data Management
Another important effect of distributed applications is that memory is not shared, so developers have to implement the communication mechanisms to transfer data between the different computing locations. It also happens when developers use the available accelerators in a computing node, where data have to be transferred from the main node memory to the device memory. Moreover, using different architectures will require to manage data serialization in order to be able to translate data formats from one architecture to the other.

Scheduling, load balancing and Resource Management
The different computing bits can be executed in different nodes and computing devices, but depending on the node used the execution can be more efficient than in another resource, so programmers have to use scheduling techniques to select the best resources to execute the different application bits.
Integration of legacy code, kernels and libraries
Sometimes users need to reuse some code which is already available in existing libraries, or kernel implemented in OpenCL or CUDA for accelerators. In these cases, developers have to implement the glue code to integrate the call of this legacy software to their application.

Dynamicity
Finally, applications can have a dynamic load, where depending on the input parameters, or the application phase it can require more or fewer resources. Therefore, another desired feature is the adaptation of the infrastructure conditions to the application load and vice versa. These features have to be implemented by the user by exchanging messages with the resource providers API.

Current solutions
In the state of the art, we can find solutions which normally solve one part of these issues, we can have remote invocation libraries or frameworks which can simplify the remote execution of processes or methods, like, SSH, RMI, Web Services, etc. We can find job schedulers and resource managers like Slurm, Hadoop YARN or other which allow users to schedule and execute jobs in different resources. And in a higher level we can find programming models OpenMP or PGAS based models, which simplifies the application parallelization in shared memory or MPI, Map-Reduce or Spark for distributed memory platforms, where an API for managing data transfers is provided by the model (as the MPI case) or automatically managed by the runtime.

What TANGO Programming Model offers about programming productivity?
The TANGO programming model, a combination of the COMPSs and OmpSs programming models, simplifies the programming of applications by transparently managing all of the aforementioned issues. We briefly describe these features in the following paragraphs.

Inherent Parallelism
The TANGO programming model offers a simple way of defining application bits as tasks, by means of pragmas, decorators or interfaces definition languages. The developers indicate the parts of code which are tasks and the direction of the data used by the task (IN, OUT or INOUT). This information is used to a detect the inherent parallelism of the application and infer which tasks can run in parallel because there are no data dependencies between them. As commented in a previous blog, we can define two levels of parallelism: coarse-grain tasks and fine grain task. So, an application is composed of a workflow of coarse-grain tasks and each coarse-grain task can be composed by a workflow of fine-grain tasks.

Transparent Remote Execution
The TANGO programming model is combined with a couple of runtimes (Platform-level and Node-level runtimes) which manages the execution of the tasks in the distributed heterogeneous systems. Each coarse grain task is scheduled and executed in one or a set of computing nodes available in the computing platform, including the transfer of the required data. This is the work done by the platform level runtime. Inside of each computing node, the node level runtime is in charge of doing the same with the fine-grain tasks and the available devices.

Simplified Integration of legacy software, kernels and libraries
The TANGO programming model offers special annotations to integrate binaries’ calls, MPI applications and CUDA or OpenCL kernels as tasks and call them as a standard function in the main application. The runtime detects data dependencies and schedules them as normal tasks and transparently manages the execution of this legacy software without requiring to write any glue code.

Self-adaptation
The TANGO programming model allows users to create different tasks versions where a task can be implemented in different ways: sequential algorithms or parallelized with fine-grain cores in multi-core CPUs or with OpenCL or CUDA kernels for accelerators. During the execution, the runtime can replace one implementation with another one depending on the available computing platform. For instance, It can use OpenCL or CUDA kernels when GPUs are available and other versions where only CPU devices are available. In this way, the TANGO programming Model allows developers to transparently adapts the application execution to the available infrastructure.
Moreover, the runtime is aware of the number of dependency-free tasks and the currently available resources. Therefore, it can detect when there are pending tasks or empty resources. In these situations, the runtime transparently contacts to other TANGO components to add or remove resources to the application execution. In this way, the TANGO toolbox allows users to self-adapt the infrastructure to the application execution.

Clone from our Github repository and try it!