Overview of the TANGO Programming Model

The current computing ecosystem is becoming more and more heterogeneous. On the one hand, trends in computer architectures focuses on providing different computing devices (CPUs, GPUs and FPGAs) and memories in a single chip or computing node, with the aim of providing better computing devices for the different types of algorithms and applications. On the other hand, supercomputers which have been traditionally composed by a large amount of homogeneous nodes, are starting to be composed by heterogeneous nodes with different cores, accelerators and memory capacities in order to achieve better performance with lower energy consumption. As consequence of this heterogeneity is that we have machines with powerful computing devices but users must know how to better use them, because executing the same algorithm in one or another device can have different results in terms of performance and energy consumption. Therefore, selecting the proper device for each part of your application is a key factor to achieve an efficient application execution.

Moreover, programming these heterogeneous platforms is not an easy task. For each accelerator, the developer has to add some code to manage transfers between device memories, and spawning processes on these devices, etc. For that reason, in the TANGO project we propose a programming model to facilitate the development of applications for next distributed heterogeneous parallel architectures.

This programming model consists on the combination of programming models and runtimes of StarSs developed at Barcelona Supercomputing Center (BSC). StarSs is a family of task-based programming models where developers define some parts of the application as tasks indicating the direction of the data required by those tasks. Based on these annotations the programming model runtime analyzes data dependencies between the defined tasks, detecting the inherent parallelism and scheduling the tasks on the available computing resources, managing the required data transfers and performing the task execution. The StarSs family is currently composed by two frameworks: COMP superscalar (COMPSs), which provides the programming model and runtime implementation for distributed platforms such as Clusters, Grids and Clouds, and Omp Superscalar (OmpSs), which provides the programming model and runtime implementation for shared memory environments such as multicore architectures and accelerators (such as GPUs and FPGAs).

In the case of TANGO, we propose to combine COMPSs and OmpSs in an hierarchical way, where an application is composed by a workflow of coarse-grain tasks developed with COMPSs. Each of these coarse-grain tasks can be implemented as a workflow of fine-grain tasks developed with OmpSs. At runtime, coarse-grain tasks will be managed by COMPSs runtime optimizing the execution in a platform level by distributing tasks in the different compute nodes according to the task requirements and the cluster heterogeneity. On the other hand, fine-grain tasks will be managed by OmpSs which will optimize the execution of tasks in a node level by scheduling them in the different devices available on the assigned node. Due to the versioning capabilities of these programming models, developers will be able to define different versions of tasks for different computing devices (CPU, GPUs, FGA) or combinations of them. So, the same application will be able to adapt to the different capabilities of the heterogeneous platform without having to modify the application. During the execution, the programming model runtime will be in charge of optimizing the execution to the available resources in a coordinated way.