The Oprofile Profiler

By Jeff Tranter Wednesday, March 2, 2016

As a developer, there are a number of useful tools that you always want to have in your arsenal for testing and debugging. This includes debuggers, memory checkers and performance profilers.

In this blog post I would like to present some information about a profiler tool that I feel deserves more recognition: oprofile.

What It Does

Oprofile ⁽¹⁾ is an open source tool for statistical performance profiling. It runs on Linux systems and uses the hardware performance counters of the CPU to perform profiling of code with low overhead. It allows all code to be profiled including applications, interrupt handlers, kernel code and shared libraries. It produces profile data at run-time, which is then post-processed into human readable reports.

Older versions of the tool required a special kernel driver and daemon, but with the Linux kernel version 2.6 or later, it uses the standard Linux kernel performance events subsystem. Because of this, unlike most profiling tools, no special recompilation or linking is needed and it can profile any process running on the system.

Oprofile works on a wide range of CPUs including Intel x86 (32 and 64-bit), DEC Alpha, MIPS, ARM, SPARC64, ppc64 and AVR32. Because it has low overhead, it is suitable for use on embedded Linux systems. On x86 and most ARM platforms, oprofile can also produce call graph profiling data.

The main limitation is that it only runs on Linux systems and requires a 2.6 or later kernel (unless you use an older version of oprofile).

Some Examples

Let's look at a simple example of profiling a C++ program.

On many Linux distributions oprofile is available as a standard package so you don't need to build it yourself. With recent versions of Ubuntu, for example, you can install it by running "sudo apt-get install oprofile".

Let's use the following C++ program as an example ⁽²⁾. It doesn't do anything particularly meaningful, but the calculations will take some time to run so that we can see where the time is spent. You can adjust the number of loop iterations if it runs too slowly or too quickly on your system.

#include <iostream>

int main()
{
    long l = 0;
    long long ll = 0;
    float f = 0;
    double d = 0;

    for (long i = 0; i < 1000000000; i++) {
        l += 2*i*i + 3*i + 4;
        ll += 3*i*i + 4*i + 5;
        f += 4*i*i + 5*i + 6;
        d += 5*i*i + 6*i + 7;
    }

    std::cout << "l = " << l << std::endl;
    std::cout << "ll = " << ll << std::endl;
    std::cout << "f = " << f << std::endl;
    std::cout << "d = " << d << std::endl;
}

You compile your code as you normally would and then run it under the operf program. When run, it will store profiling data, by default in an oprofile_data sub-directory.

On my system, I see the following output when I first compile and then run it, where it takes about seven seconds to execute (the commands I typed are in bold):

% g++ -g -Wall -o example1 example1.cpp

% operf ./example1

operf: Profiler started
l = 389932800
ll = -33545635511552
f = 6.95627e+11
d = -3.77758e+12

Profiling done.

The opreport program can then be used to read the profile data that was collected and display it in a meaningful way. Here is the output on my system when run with the default options:

% opreport --symbols
Using /home/tranter/git/jtranter/blog/oprofile/oprofile_data/samples/ for samples directory.
warning: /no-vmlinux could not be found.
CPU: Intel Ivy Bridge microarchitecture, speed 2.901e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
183018   99.9727  example1                 main
33        0.0180  no-vmlinux               /no-vmlinux
5         0.0027  ld-2.19.so               _dl_lookup_symbol_x
4         0.0022  ld-2.19.so               do_lookup_x
2         0.0011  ld-2.19.so               check_match.9376
1        5.5e-04  ld-2.19.so               _dl_map_object_from_fd
1        5.5e-04  ld-2.19.so               _dl_name_match_p
1        5.5e-04  ld-2.19.so               _dl_relocate_object
1        5.5e-04  ld-2.19.so               strcmp
1        5.5e-04  libgcc_s.so.1            /lib/i386-linux-gnu/libgcc_s.so.1
1        5.5e-04  libstdc++.so.6.0.19      /usr/lib/i386-linux-gnu/libstdc++.so.6.0.19

From this we can observe that over 99% of the time was spent in the application binary itself and negligible time in the shared libraries it uses, the Linux kernel (/no-vmlinux), or the dynamic linker/loader (ld.so). Within the application, the time was spent in the function main.

To further drill down into where the time was spent in the code, we can annotate the source code with timing information. For that to work we need some symbolic debugging information. Earlier we compiled with the -g option so the binary should have that information and we can run the opannotate tool to annotate the code:

% opannotate --source --output-dir=$PWD
Using /home/tranter/git/jtranter/blog/oprofile/oprofile_data/samples/ for session-dir
warning: /no-vmlinux could not be found.
opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/elf/dl-lookup.c
opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/string/../sysdeps/i386/i686/multiarch/../strcmp.S
opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/elf/get-dynamic-info.h
opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/elf/dl-misc.c
opannotate (warning): unable to open for reading: /build/buildd/eglibc-2.19/elf/../sysdeps/i386/dl-machine.h

If we had source code available for the C run-time library, it could have annotated that for us, but this was not available, hence the above warnings.

We should now have an annotated version of the source file for the application itself. On my system it looked like the following:

               :#include <iostream>
               :
               :int main()
               :{ /* main total: 186787 99.9615 */
               :    long l = 0;
               :    long long ll = 0;
               :    float f = 0;
               :    double d = 0;
               :
  7720  4.1315 :    for (long i = 0; i < 1000000000; i++) {
 20853 11.1598 :        l += 2*i*i + 3*i + 4;
 31944 17.0952 :        ll += 3*i*i + 4*i + 5;
 78718 42.1270 :        f += 4*i*i + 5*i + 6;
 47552 25.4481 :        d += 5*i*i + 6*i + 7;
               :    }
               :
               :    std::cout << "l = " << l << std::endl;
               :    std::cout << "ll = " << ll << std::endl;
               :    std::cout << "f = " << f << std::endl;
               :    std::cout << "d = " << d << std::endl;
               :}
/* 
 * Total samples for file : "/home/tranter/git/jtranter/blog/oprofile/example1.cpp"
 * 
 * 186787 99.9615
 */

/* 
 * Command line: opannotate --source --output-dir=/home/tranter/git/jtranter/blog/oprofile 
 * 
 * Interpretation of command line:
 * Output annotated source file with samples
 * Output all files
 * 
 * CPU: Intel Ivy Bridge microarchitecture, speed 2.901e+06 MHz (estimated)
 * Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
 */

From the above we can see the number of "hits" or times that the statistical profiler found the code executing the line of code in question, as well as the percentage of overall time spent on that line.

In this particular program, we see the majority of the time was spent, not surprisingly, in the for loop.

We also make some other observations, such as the fact it apparently takes longer to do the calculation using type float than it does with type double. Further investigation would probably be warranted if we wanted to confirm if that was actually the case.

As an example of the overhead of running with the profiler, we can compare the execution times and see that it was not much slower than when running the application normally:

% time operf ./example1
operf: Profiler started
l = 389932800
ll = -33545635511552
f = 6.95627e+11
d = -3.77758e+12
Profiling done.
real    0m7.467s
user    0m9.088s
sys     0m5.824s

% time ./example1
l = 389932800
ll = -33545635511552
f = 6.95627e+11
d = -3.77758e+12
real    0m6.749s
user    0m6.740s
sys     0m0.000s

Note that the sys time, was much higher. This is time spent in the Linux kernel and is presumably higher due to the use of the performance counters in the kernel.

One tip: If you don't see any annotated source file after running opannotate, the most likely reason is that not enough time was spent to get any profiler sample hits. Only files that had samples get annotated.

Now let's look at a simple Qt program example. This example uses some of the Qt container classes and might give some insight into how they compare in terms of performance. Here is the source listing:

#include <QDebug>
#include <QLinkedList>
#include <QList>
#include <QVector>

int main()
{
    QList<int> list;
    QLinkedList<int> linkedList;
    QVector<int> vector;

    for (long i = 0; i < 10000000; i++) {
        list << i;
        linkedList << i;
        vector << i;
    }

    /*
    for (long i = 0; i < 500000; i++) {
        list.prepend(i);
        linkedList.prepend(i);
        vector.prepend(i);
    }
    */

    long long total = 0;
    foreach (auto item, list) {
        total += item;
    }
    qDebug() << "total = " << total;

    total = 0;
    foreach (auto item, linkedList) {
        total += item;
    }
    qDebug() << "total = " << total;

    total = 0;
    foreach (auto item, vector) {
        total += item;
    }
    qDebug() << "total = " << total;
}

These are the commands I used to build and run it and collect profiling information:

% qmake

% make

% operf ./example2
operf: Profiler started
total =  49999995000000
total =  49999995000000
total =  49999995000000
Profiling done.

% opreport --symbols

Using /home/tranter/git/jtranter/blog/oprofile/oprofile_data/samples/ for samples directory.
warning: /no-vmlinux could not be found.
CPU: Intel Ivy Bridge microarchitecture, speed 2.901e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
7356     12.5832  example2                 main
6474     11.0744  libc-2.19.so             _int_malloc
3723      6.3686  example2                 QtPrivate::RefCount::isShared() const
3561      6.0914  example2                 QLinkedList<int>::append(int const&)
3142      5.3747  libc-2.19.so             malloc
3040      5.2002  example2                 __x86.get_pc_thunk.bx
2875      4.9180  libc-2.19.so             _int_free
2777      4.7503  no-vmlinux               /no-vmlinux
2383      4.0764  example2                 QVector<int>::append(int const&)
2309      3.9498  libstdc++.so.6.0.19      /usr/lib/i386-linux-gnu/libstdc++.so.6.0.19
2219      3.7958  libQt5Core.so.5.5.1      /usr/local/Qt-5.5.1/lib/libQt5Core.so.5.5.1
2217      3.7924  example2                 QBasicAtomicInteger<int>::load() const
1857      3.1766  example2                 QList<int>::append(int const&)
1794      3.0688  example2                 QArrayData::data()
1174      2.0082  example2                 QLinkedList<int>::detach()
1159      1.9826  example2                 QTypedArrayData<int>::end(int*)
1013      1.7328  libc-2.19.so             free
857       1.4660  example2                 QLinkedList<int>::freeData(QLinkedListData*)
793       1.3565  example2                 QTypedArrayData<int>::data()
757       1.2949  example2                 QVector<int>::isDetached() const
691       1.1820  example2                 QList<int>::operator<<(int const&)
689       1.1786  example2                 QLinkedListNode<int>::QLinkedListNode(int const&)
585       1.0007  example2                 QVector<int>::operator<<(int const&)
575       0.9836  example2                 QLinkedList<int>::operator<<(int const&)
554       0.9477  example2                 QList<int>::const_iterator::operator*() const
538       0.9203  libc-2.19.so             __x86.get_pc_thunk.bx
448       0.7663  example2                 QLinkedList<int>::const_iterator::operator!=(QLinkedList<int>::const_iterator const&) const
438       0.7492  example2                 QList<int>::const_iterator::operator!=(QList<int>::const_iterator const&) const
427       0.7304  example2                 QList<int>::node_construct(QList<int>::Node*, int const&)
419       0.7167  example2                 QList<int>::const_iterator::operator++()
403       0.6894  example2                 QLinkedList<int>::const_iterator::operator++()
321       0.5491  example2                 qt_noop()
294       0.5029  example2                 QLinkedList<int>::const_iterator::operator*() const
287       0.4909  libc-2.19.so             __memcpy_ssse3_rep
269       0.4602  example2                 QList<int>::Node::t()
13        0.0222  ld-2.19.so               do_lookup_x
10        0.0171  ld-2.19.so               _dl_lookup_symbol_x
5         0.0086  ld-2.19.so               strcmp
2         0.0034  ld-2.19.so               _dl_relocate_object
2         0.0034  ld-2.19.so               strlen
2         0.0034  libicuuc.so.52.1         /usr/lib/i386-linux-gnu/libicuuc.so.52.1
1         0.0017  ld-2.19.so               _dl_catch_error
1         0.0017  ld-2.19.so               _dl_fixup
1         0.0017  ld-2.19.so               _dl_map_object_deps
1         0.0017  ld-2.19.so               dl_main
1         0.0017  libc-2.19.so             _dl_addr
1         0.0017  libc-2.19.so             pthread_mutex_init
1         0.0017  libc-2.19.so             realloc

% opannotate --source --output-dir=$PWD
% less home/tranter/git/jtranter/blog/oprofile/example2.cpp

               :#include <QDebug>
               :#include <QLinkedList>
               :#include <QList>
               :#include <QVector>
               :
               :int main()
               :{ /* main total:   7356 12.5832 */
               :
               :    QList<int> list;
               :    QLinkedList<int> linkedList;
               :    QVector<int> vector;
               :
   319  0.5457 :    for (long i = 0; i < 10000000; i++) {
   330  0.5645 :        list << i;
   477  0.8160 :        linkedList << i;
   235  0.4020 :        vector << i;
               :    }
               :
               :    /*
               :    for (long i = 0; i < 500000; i++) {
               :        list.prepend(i);
               :        linkedList.prepend(i);
               :        vector.prepend(i);
               :    }
               :    */
               :
               :    long long total = 0;
  1625  2.7797 :    foreach (auto item, list) {
   504  0.8621 :        total += item;
               :    }
               :    qDebug() << "total = " << total;
               :
               :    total = 0;
  1661  2.8413 :    foreach (auto item, linkedList) {
   576  0.9853 :        total += item;
               :    }
               :    qDebug() << "total = " << total;
               :
               :    total = 0;
  1124  1.9227 :    foreach (auto item, vector) {
   505  0.8639 :        total += item;
               :    }
               :    qDebug() << "total = " << total;
               :}
/* 
 * Total samples for file : "/home/tranter/git/jtranter/blog/oprofile/example2.cpp"
 * 
 *   7356 12.5832
 */

/* 
 * Command line: opannotate --source --output-dir=/home/tranter/git/jtranter/blog/oprofile 
 * 
 * Interpretation of command line:
 * Output annotated source file with samples
 * Output all files
 * 
 * CPU: Intel Ivy Bridge microarchitecture, speed 2.901e+06 MHz (estimated)
 * Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
 */

From the above, a few observations that immediately jump out are that the application spends a considerable amount of time in the standard C library calling malloc or related functions and in Qt is spending a lot of time calling QtPrivate::RefCount::isShared() and QLinkedList<int>::append().

Looking at the annotated source code, not a lot of time is spent in the code in the main function itself, with the iterators taking most of this time. The times spent iterating over and appending to the QList, QLinkedList and QVector data structures are not too different.

For comparison, if you uncomment the code that prepends the elements to the start of the containers rather than appending to the end, you will see a significant difference in performance between the different container types. For example, the majority of time is being spent in the memmove() function as items are prepended to the QVector and the existing elements have to be moved. On my system oprofile even annotated the relevant Qt source code where this was happening.

Summary

Oprofile has a number of advanced features we didn't cover. It can produce call graph information that can be read and displayed by the graphical KCacheGrind ⁽³⁾ application. Because oprofile is a system wide profiler, it can generate performance information for all processes running on the system (subject to permissions). This is useful if your application communicates with other processes that are the performance bottleneck. This can be the case, for example, when using X11 or Wayland as your rendering back end.

Unfortunately, oprofile doesn't work on the Raspberry Pi platform due to the ARM and SOC hardware used on that platform. Other profiling tools such as gprof, gperftools and valgrind should work ⁽⁴⁾, but I have not personally tried them.

Oprofile is one of the tools you should become familiar with and have in your developer bag of tricks. If you have used other profiling tools in the past, I encourage you to take a look at it.

References

OProfile Project, SourceForge web site, last accessed 15 Jan 2016, http://oprofile.sourceforge.net
Source Code for Example Programs, last accessed 15 Jan 2016, ftp://ftp.ics.com/pub/pickup/oprofile.zip
The KCachegrind Handbook, KDE documentation web site, last accessed 15 Jan 2016, https://docs.kde.org/trunk4/en/kdesdk/kcachegrind/index.html
Software Profiling tools for Raspberry Pi, blog post, last accessed 15 Jan 2016, http://www.softwarecoven.com/software-profiling-tools-for-raspberry-pi/