Digital signal processing (DSP) benefits from early analysis of computation. David Turnbull tells me about his exploration optimizing the FFT algorithm with C++11 using recursive templates and profile-guided optimization.
See Software Defined Radio to understand David's work and our interest in it.
The AE9RB DSP core library includes a custom benchmarking and testing framework specifically designed for DSP. Here example output from GCC 4.9.
After considering a variety of implementation technologies David has chosen C++ and defends his choice by describing two optimizations new to me.
Our conversation took place at DorkbotPDX but included material previously presented at PDXCPP. meetup
Templates
Templates exist in C++ to provide compile time substitutions of types into algorithms so that the code generator can choose instructions appropriate to those types on a case by case basis.
Beyond the intended application in collection classes, template based pre-compilation has become an art in of itself sometimes called generative programming or template metaprogramming. wikipedia
Policy-based design uses an organizational variant of metaprogramming where policy classes provide strategy-like configurations resolved at compile time. wikipedia
The discrete Fourier transform has considerable computational redundancy that can be omitted with careful analysis such as that recognized by Cooley & Tukey in 1965. wikipedia
The fast Fourier transform (FFT) uses a divide and conquer approach that saves redundant multiplies. It is reminiscent of recursive sort algorithms that save redundant compares. Both are thus of computational order n log n.
Dijkstra fought to have recursive functions included in Algol60 even if they were then considered costly. pdf
Tony Hoare artfully exploited Algol60 recursion in Quicksort. wikipedia
David captures the recursion in FFT by defining the algorithm within a C++ template operating on a known size vector of data. The recursion happens at compile time as exponentially smaller problems are cast to function definition of fixed size. github
The optimizer is free to reconsider each size variation independently and might choose different strategies based on the availability of registers. Note that since the size of each instantiation is known and constant this frees one more register for the computation at hand.
David breaks the recursion by defining the trivial transform of size one. github
template<typename T> class Radix2<T, 1> { public: void mix(T* data) { } };
Templates here provide pre-compilation logic more normally associated with macro preprocessors.
The C language has always used a weak preprocessor for file inclusion and conditional compilation.
The Lisp family of languages have traditionally used macros which work well since Lisp source looks like data.
The template techniques described here benefit from the comprehensive optimization available downstream.
Arlo Belshee first described template metaprogramming to me and showed how he used it for concrete realization of software patterns.
Paul Stoffregen has shown me where he bends the gcc code generator to his will through extensive case analysis.
Profiles
Once the recursive FFT has been unwound into separate functions for each recursion those functions become subject to more exotic optimizations.
David brought to my attention profile-guided optimization where run-time statistics are fed back into the optimizer based on trial runs with real data. wikipedia
My association with profiling dates back to the '70s with SPR, the PP program on the CDC-6000 that would sample the program counter and print a report at the end of a job step. bibtex
Direct feedback of profile information into an optimizer is more associated with dynamic optimizers such as query optimizers in database systems or just-in-time optimizers in byte-code interpreters.
Gcc, Microsoft, Sun/Oracle and Intel all offer a feedback path for profile data into their C++ optimizers. However the subject seems not widely discussed save for Mozilla's application to Firefox. webpage