Designing and developing software with an efficient execution on distributed environment with fairly standard homogeneous processing nodes is already a difficult exercise. This complexity explodes when targeting a heterogeneous environment composed not only of distributed multi-core CPU nodes but also including accelerators with many-core CPUs, GPUs and FPGAs. In the current era of heterogeneous hardware, software development teams face the daunting task of designing software capable to exploit underlying heterogeneous hardware devices available to the most of their capability with the goal to achieve the best time and energy performance.
The algorithmic decomposition chosen to solve the problem at hand and the selected granularity of computing task determine software execution efficiency on a given underlying hardware hence affect time and energy performance. For instance, many algorithms exist for matrix operations, data sorting, or finding a shortest path in a graph. Developers already take into account data properties such as matrix sizes or degree of graph connectedness to select an algorithm with optimal time and energy performance. Nowadays, they must also consider capabilities offered by hardware in terms of parallel processing and data throughput. Such hardware capabilities influence design decision on algorithmic decomposition and task granularity choices, in the sake of efficient performance. For instance, time and energy performance associated with matrix multiplication on GPU or FPGA is directly influenced by matrix data sizes as well as the level of parallelism possible on each different kind of processing nodes as well as their clock speed, their memory capacity, their data transfer latencies, internally within the chip and externally through their I/O interfaces. In other words, the most appropriate algorithmic decomposition and task granularity is jointly influenced by data properties as well as the capabilities of the underlying heterogeneous hardware available.
In addition to designing software for today’s operational conditions, developers must also strike the right balance between achieving an optimal performance now and keeping a design implementation flexible and evolvable for tomorrow. The most efficient algorithmic decomposition and task granularity for today’s heterogeneous hardware and dataset properties will change in the future. In the worst case, evolution in hardware or data properties impacts software design and architecture forcing developers to adapt drastically the application code, that is, another algorithm must be implemented in order to better exploit the new hardware or the new kind of data. In less radical situations, a current overall software architecture and algorithms can remain unaltered. Only the task granularity must be changed to process larger quantity of data at once for instance. New technologies and programming model such as OpenCL or OmpSs/COmpS can facilitate accommodating task granularity without much effort hence keeping software implementation fairly evolvable. However, it still remains the job of developers to identify the appropriate task granularity for achieving improved time and energy performance and to provide this granularity information to the underlying technology or programming modelling tools.
One of the goal of TANGO project is to provide design-time tooling to help developers to make insightful design decisions to implement their software so as to exploit the underlying hardware irrespective of the programming technologies and programming models chosen.
The initial approach to guide design decision, proposed in the first year of TANGO, , relies on the rapid prototyping of the various simple software building blocks needed in a given application. The first step for developers consists of developing a set of simple prototypes for selected building blocks. Each prototype implements a particular algorithmic decomposition and task granularities for one of these simple software building blocks. For instance, a C or CUDA implementation of matrix multiplication will respectively targets CPUs or NVIDIA GPUs.
Developers can usually find alternative implementations of simple software building blocks that targets processors with fixed instruction set such as multi-core, many-core and GPU. These implementations rely on programming technologies such as MPI, OpenMP, CUDA or OpenCL. On the other hand, the use of FPGA and other reconfigurable hardware has so far remained more complex and only used by much fewer experts. To address this issue, the TANGO development-time tooling proposes a tool, named Poroto, to ease porting segments of standard higher-level code to FPGA. While OpenCL has recently proposed synthetisation for FPGA as part of its compilation tool chain, in many cases, developers only have implementation of simple building blocks in C code (or other programming languages). In such cases, an initial prototype implementation may be easier with Poroto than having to re-write the current C code in OpenCL. By annotating portions of C code with Poroto pragmas the tool enables the generation of associated FPGA kernels and their interfacing to the code running on the CPU of the host machine through PCI. The main processing program remains in C and is augmented with the necessary code, encapsulated in a C wrapper file, that handle data transfer control of the offloaded FPGA computations. The C portion to be offloaded on FPGA is actually transformed into an equivalent VHDL program leveraging open source C to VHDL compilers such as ROCCC (or other tools). Subsequently, the VHDL can be passed to the lower level synthesis toolchains from the particular FPGA vendor like Xilinx or Intel/Altera to generate to bitstream for a specific FPGA target. Concerning data transfer to/from the FPGA, Poroto currently relies on a proprietary technology. However an on-going TANGO effort consists of replacing this proprietary technology with RIFFA (an open source framework) to achieve similar data transfer operations.
Once the various prototypes of the different simple software building blocks have been implemented, compiled and deployed on the different targeted heterogeneous hardware, it becomes possible to obtain benchmarks with different representative datasets on each of the prototype variants. The benchmarking exercise is not restricted to FPGA implementation, the initial code of simple software block can be executed on multicore and many-core CPUs and if code also exists for GPU, it can also be included in the benchmarking exercise.
Once the time and energy benchmarks for the different prototype implementations of the various simple blocks has been collected from the execution on the different heterogeneous hardware targeted, developers must then identify an optimal way to place a combination of prototype implementations on the various hardware devices available in the initial operational environment to achieve an efficient time and energy performance. This optimisation problem between time and energy is not simple to solve for various prototype implementations of a single simple software block. Thus, when considering different prototype implementations of several simple blocks competing for an initial fixed set of heterogeneous hardware resources, it becomes very useful to automate this optimisation exercise.
In TANGO, the development-time tooling relies on an open source optimisation engine originated from operational research named OscaR to search optimal ways to map the implementation of different software block on the different heterogeneous hardware nodes available. During the first year of TANGO, an initial optimisation named Placer using a constraint programming technique and search is proposed in order to find a placement of software blocks for optimising energy performance. The execution of Placer is fairly fast given that current placement problems are somehow small. It is therefore quick for developers to run several times Placer for different subset of hardware resources if desired. This will enable exploring the potential of adding or not particular type of heterogeneous hardware resources in the operational environment against leaving it free for others applications to use.
This blog gave a first glance on the design-time challenges faced by developers aiming at optimised implementations of their software over heterogeneous hardware infrastructures. It introduced quickly two development-time tools from the TANGO project, namely, Poroto and Placer. Both will be described in more details and insights in future blog posts.