Profiling Tools

Most known is the GCC profiling tool gprof: one needs to compile the program with option -pg; running the program generates a file gmon.out, which can be transformed into human-readable form with gprof. One disadvantage is the required re-compilation step to prepare the executable, which has to be statically linked. The method used here is compiler-generated instrumentation, which measures call arcs happening among functions and corresponding call counts, in conjunction with TBS, which gives a histogram of time distribution over the code. Using both pieces of information, it is possible to heuristically calculate inclusive time of functions, i.e. time spent in a function together with all functions called from it.

For exact measurement of events happening, libraries exist with functions able to read out hardware performance counters. Most known here is the PerfCtr patch for Linux®, and the architecture independent libraries PAPI and PCL. Still, exact measurement needs instrumentation of code, as stated above. Either one uses the libraries itself or uses automatic instrumentation systems like ADAPTOR (for FORTRAN source instrumentation) or DynaProf (code injection via DynInst).

OProfile is a system-wide profiling tool for Linux® using Sampling.

In many aspects, a comfortable way of Profiling is using Cachegrind or Callgrind, which are simulators using the runtime instrumentation framework Valgrind. Because there is no need to access hardware counters (often difficult with today's Linux® installations), and binaries to be profiled can be left unmodified, it is a good alternative to other profiling tools. The disadvantage of simulation - slowdown - can be reduced by doing the simulation on only the interesting program parts, and perhaps only on a few iterations of a loop. Without measurement/simulation instrumentation, Valgrind's usage only has a slowdown factor in the range of 3 to 5. Also, when only the call graph and call counts are of interest, the cache simulator can be switched off.

Cache simulation is the first step in approximating real times, since runtime is very sensitive to the exploitation of so-called caches, small and fast buffers which accelerate repeated accesses to the same main memory cells, on modern systems. Cachegrind does cache simulation by catching memory accesses. The data produced includes the number of instruction/data memory accesses and first- and second-level cache misses, and relates it to source lines and functions of the run program. By combining these miss counts, using miss latencies from typical processors, an estimation of spent time can be given.

Callgrind is an extension of Cachegrind that builds up the call graph of a program on-the-fly, i.e. how the functions call each other and how many events happen while running a function. Also, the profile data to be collected can separated by threads and call chain contexts. It can provide profiling data on an instruction level to allow for annotation of disassembled code.