Monitoring heterogeneous processors with collectd - how to create a custom plugin in C

In this blog post we will walk through the process of creating a collectd plugin in C for monitoring NVIDIA GPUs.

In a previous blog we introduced the monitoring objectives of the TANGO solution. One of these objectives is the monitoring of different kind of processors, including NVIDIA GPUs, in order to get from them energy measurements, among other metrics. For this purpose, we decided to rely on collectd and integrate it in the stack of tools responsible for the monitoring tasks. Collectd is a UNIX daemon, written in C, which periodically collects system and applications performance metrics. One of the main advantages of this tool is that it uses a modular approach which allows the use of custom plugins to enhance its monitoring capabilities.

Why this blog post? The main reason to write this post is that when I started to work on the TANGO Monitoring Infrastructure, I had to face some difficulties. As mentioned above, one of our main tasks was to program in C the tools to monitor these heterogeneous devices or processors, and to integrate them in collectd. On one hand, I was not used to program in C (I’m more a Java guy), and it had passed a lot of time since the last time I used this programming language. On the other hand, it was not easy to find articles, documentation or tutorials that could help us in doing this concrete task.

As the creation process of a plugin for NVIDIA devices is similar to the creation of other custom (collectd) plugins, I hope that this blog entry can help other programmers to extend the capabilities of a monitoring system like the one used in the TANGO solution.

Why C? First, device drivers are written often in the C language. This is because C is "nearer" to the hardware than other programming languages, allowing an efficient and easy management of those devices. Second, collectd is written in C too. Thus, although collectd can use plugins written in other programming languages, we thought that C was the best and more coherent option.

Where to start? First and foremost, we need a computer with a NVIDIA GPU deployed and running on it :)

Second, we need to install the NVIDIA Management Library in order to be able to monitor and manage a NVIDIA device. The NVML API Reference documentation [1] describes not only the NVIDIA devices supported by this library, but also all the available methods (device queries, initialization, errors handling, etc.).

Finally, before programming and installing the plugin, we need to install collectd. There is more than one way to deploy and install it. I prefer the one that requires cloning the collectd github repository. This way, its sources can be used to compile new custom plugins, like the one shown in this tutorial.

How to create a collectd plugin for NVIDIA devices?
[The source code showed in this tutorial can be found in github as part of the TANGO repository.]

Once we have the NVIDIA library and collectd deployed in our system, we can start coding the plugin.

First, we start with the skeleton of the plugin (see attached image). This skeleton is composed by four functions: initialization (my_init), read (my_read), shutdown (my_shutdown), and finally a function to register all the previous ones in collectd. There are more callbacks or functions supported by collectd, like configuration, write, log and flush. The collectd wiki [2] shows an overview of the plugin architecture with all the supported functions.

The ‘initialization’ function is called only once (at the beginning), and we will use it to initialize the NVML library and other resources needed by the program.

static int my_init(void) {
  nvmlReturn_t result;
  result = nvmlInit();
  if (NVML_SUCCESS != result) {
    printf("Failed to initialize NVML: %s\n", nvmlErrorString(result));
    plugin_log(LOG_WARNING, "Failed to initialize NVML");
    return 1;
  }
  return 0;
}

The ‘shutdown’ method is also called once (at the end), and it’s used to finalize the NVML library and free all the resources created by this plugin.

static int my_shutdown(void) {
  nvmlReturn_t result;
  result = nvmlShutdown();
  if (NVML_SUCCESS != result) {
    printf("Failed to shutdown NVML: %s\n", nvmlErrorString(result));
    plugin_log(LOG_WARNING, "Failed to shutdown NVML");
    return -1;
  }
  return 0;
}

Finally, the ‘read’ function is responsible for calling the NVML methods needed to measure the metrics used by the monitoring system: temperature, clock frequency, power, etc. These values are submitted later to collectd.

The content of this function can be seen in the following url: https://github.com/TANGO-Project/monitor-infrastructure/blob/master/Collectd/nvidia_plugin/nvidia_plugin.c

Once the plugin (C file) is ready, the next step is to compile it. For this purpose I use the source files from collectd, and I also reference the NVIDIA library in the following command:

gcc -DHAVE_CONFIG_H -Wall -Werror -g -O2 -shared -fPIC -I/COLLECTD_SOURCES_/ -I/COLLECTD_SOURCE/daemon/ -Wl,--no-as-needed -lnvidia-ml -LNVIDIA_LIBS/ -ldl -o nvidia_plugin.so nvidia_plugin.c

The resulting file, nvidia_plugin.so, has to be copied in the collectd plugins folder (/opt/collectd/lib/). And it also has to be added and “enabled” in the collectd configuration file (collectd.conf):

...
LoadPlugin nvidia_plugin
...

After following all these previous steps and restarting collectd, you should be able to monitor a NVIDIA device :)

I hope you found this tutorial helpful. In the TANGO Monitoring Infrastructure repository you can find more information that can be useful (e.g. how to add new types to the collectd types collection, or how to connect colletd to other monitoring or visualization tools), apart from all the complete source code showed in this blog.

REFERENCES

[1] NVML - Reference Manual https://docs.nvidia.com/deploy/pdf/NVML_API_Reference_Guide.pdf

[2] Collectd Wiki https://collectd.org/wiki/index.php/Plugin_architecture