Dr. Lawlor's Code, Robots, & Things

January 11, 2014

CPU Performance Counters with “likwid”

Filed under: C++11, Programming — Dr. Lawlor @ 12:01 am

I’m having fun playing with a little command line performance counter toolkit with the silly name of “likwid” which stands for “Like I know what I’m doing!”  If you haven’t used performance counters while programming, they can very quickly tell you surprising things about your code’s performance, like the fact that you’re getting clobbered by cache misses, not floating point arithmetic.

To get the toolkit running on a recent Ubuntu linux machine (tested on 13.10):

(more…)

December 30, 2012

C++11 Performance Analysis

Filed under: C++11, Programming — Dr. Lawlor @ 1:29 pm

My favorite new language is C++11, also known as C++Ox.  Contrary to popular belief, compiled languages aren’t dead, and in fact they’re more important than ever: the two big problems with the popular runtime typed, interpreted languages (JavaScript, Perl, Python, PHP, Ruby, Scheme, etc) are (1) speed and (2) robustness.  The speed thing has been getting better as people build dynamic translation engines, but the robustness problem is harder to fix.  In interpreted languages, developers don’t get compile errors; instead your *users* get runtime errors.  This is really the wrong time and place to find out you’ve passed two arguments to a function taking three arguments, or a string where you should have a number or object.

Anyway, C++11 has lots of cool new features, mostly aiming for more expressive code.  As a performance-oriented programmer, I always feel the need to understand the cross-platform delivered performance of any new feature before I use it very much.  So I worked my way through Alex Sinyakov’s excellent slideshow summarizing many of the new features, and built quite a few performance tests.  The platforms I tested are:

  • g++ 4.7.2, on Linux.  Generally, g++ had faster allocations, and has a little more of the language implemented.  Note that 4.7.0, from this March, is missing support for delegated constructors and literal operators, but these are supported in the most recent 4.7.2.
  • clang 3.3, on Linux. Generally this performed almost identically to g++, possibly since it’s using the same libstdc++ headers.  You want the latest version from SVN, since the 3.0 that ships with Ubuntu 12.10 doesn’t support initializer lists, seems to segfault on various simple lambdas, and doesn’t link with libstdc++.
  • Microsoft Visual Studio 2012, Release Mode, on Windows 8.  Generally, VS was better at automatically inlining functions.  But it’s missing initializer lists, delegated constructors, raw string literals, variadic templates, and a few other pieces.  Error messages for templated code are totally useless, not even giving you the types being instantiated.

I am really impressed at how much of C++11 is already working, and working the same way on these three modern compilers.  It’s also amazing how many cool language features have *zero* runtime performance cost: they’re inlined and optimized away at compile time.  The performance of most pieces of code seems to boil down to one question: is there memory allocation?  If so, the function is really slow, taking tens of nanoseconds on either platform.  If not, the function is fast, taking only a couple of nanoseconds.  This means std::string and std::vector are *much* slower to create than character buffers and arrays, although accessing them is the same speed.

  • auto, nullptr, <type_traits>, static_assert, and even <tuple> work on both platforms, and have zero runtime performance cost.  Range-based for is exactly as fast as a loop across integer indices.  for_each is exactly as fast as a loop between iterators.  Returning a tuple from an actual (non-inlined) function call is only a fraction of a nanosecond more expensive than returning a native type, and exactly the same cost as returning a custom struct.
  • std::function is weirdly slow to create (10-20ns), and costs 1-2 extra function call overheads (4-6ns) to execute.  This is despite lambdas and std::bind being entirely inlined and cost-free if you put the result into an “auto”, but slow if put into a std::function.
  • unique_ptr is nearly the same cost as a bare pointer (30-60ns), but much easier to make correct.  shared_ptr costs over twice as much (60-140ns) if you use new, probably because there’s a separate memory allocation for the reference count.  Chris Hartman pointed out “make_shared” not only reduces code duplication, it can merge the refcount allocation, and it is indeed almost as fast as unique_ptr.
  • <random> makes it cheap to create random number engines and distributions, and accessing them is about as fast (10-20ns) as calling the old rand.

I’ve only begun to benchmark the containers, but there are a few interesting results.  Ordered from slowest to fastest access times, reported as write time per integer for a new container with 100 integers:

  •  Bare [] arrays are the fastest of all, 0.3-0.4ns
  • <array> is pretty good, 0.4-0.5ns
  • <vector> is OK, 2.7ns if you’ve pre-reserved, 8-13ns if you just push_back and let it reallocate and copy.  In either case writes are far slower than any type of array, which I blame mostly on the dynamic memory allocation required by <vector> compared to the fixed-size stack allocation of <array>.
  • <deque> is decent on gcc at 4ns, slow on Visual at 30ns(!) per element.  I’ve heard gcc does block allocation to make deque faster.
  • <list> and <forward_list> are 30-90ns per element, and clearly do one allocation per element.
  • <map> and <unordered_map> are 60-130ns per element.  Currently <unordered_map> isn’t very compelling, at least for small maps up to a few thousand elements.  In my cursory tests, writes are about the same speed as the old <map>.  Unordered reads are 3x faster than <map> reads under gcc, but nearly 2x slower under Visual.

The fact that storing an integer to a <map> is 200-fold slower (20000% slower!) than storing the same integer to an array[] seems pretty surprising to me.  Too bad the STL allocator interface isn’t easy to improve. Maybe this is decade-old news to everybody else!  Note that reads aren’t nearly as bad, probably because there’s no allocation doing a read.

See the detailed numerical results and source code here.  All of these were run on my Ivy Bridge i7-3632QM laptop (quad core, 2.2-3.1Ghz), though spot tests on my Sandy Bridge i5-2400 desktop runs at roughly the same speed.

Blog at WordPress.com.