The initial code
The code I'm aiming to improve is EDuke32's AudioLib. Used by open-source versions of many classic 90's DOS games, this code has a long and storied history, and can be traced back to Jonathon Fowler's C conversion of Apogee's original assembler code. Based on the above analysis, and it's easy to draw up a list of optimisations to try out: To simplify testing, I developed the following makefile which handles compiling each test program/source file for multiple different CPUs, and with different optimisation settings. Before we get to the results, here are the inner loops of the different ARM-optimised routines: For testing the performance of the routines, I settled on using six different machines, covering five different architectures, mainly focused on the slowest machines I had available in each category. The graph below shows the performance of the float-only, LUT-only, and integer-only routines. Performance is relative to the original algorithm: e.g. the float results which are clustered around the number 5 mean that the float algorithm is about 5 times faster than the original algorithm. Although each machine is affected in different ways, there's a clear progression here: the original routine is slowest, float-only is faster than the original, LUT-only is faster than float-only, and integer-only is the fastest of them all. Now that we've worked out that integers are best, we need to work out what ARM optimisations are best. AudioLib contains 9 dedicated mixing functions (and a couple of other mixing loops dotted around the place), and I don't really feel like writing optimised versions for every architecture under the sun - so being able to reduce the list to one or two good ones will make things a lot more manageable. Again, there are two lines for each machine: 'baseline' and 'optimal'. These represent whether the code was compiled for the lowest-possible CPU or the best-possible CPU; e.g. for the ARMv3M routine, running on ARMv7, 'baseline' indicates that it was compiled for ARMv3M, while 'optimal' indicates it was compiled for ARMv7. This is relevant because in addition to working out what source code variants are important, I also need to decide on how many different versions of the NBlood binary I need to build/distribute. For the audiolib source, I'm likely go with the following three code variants: Integer, ARMv3M, and NEON. ARMv5TE and ARMv6 specific versions just don't look like they're going to be worth the effort - most of the performance gains from the ARMv5TE/ARMv6 versions come from the fact that the compiler has been told to use v5 or v6 instructions, not from the changes that were made to the source code. The observant of you will have spotted that the makefile I provided above also builds FPA versions of the binaries. Depending on machine, the 'float' code when built for FPA is generally about 5-10 times slower than soft-float, and from 15 (BB-xM) to 58 (Titanium) times slower than VFP.
The original Apogee code looks like it was optimised for playback of unsigned 8-bit sound samples. A volume-specific lookup table is used to convert each sample to signed 16 bit and apply volume scaling (with a healthy dose of self-modifying code for patching in the correct table/buffer addresses). For mid-nineties x86 hardware, this is probably the best approach available. And for ARM, table lookup is a pretty good approach too. But changes which have been made over the years have eroded the performance of the code, making it many times slower than the original. How slow exactly? Slow enough for audio mixing in my
My profiling results showed that 20% of the CPU time was spent in MV_Mix16BitMono, and 32% of CPU time was spent in the library function lrintf. Examination of the source, and what GCC's done (and failed to do) during optimisation, gives me some clues as to why performance is so bad:
Additionally, the table-based volume scaling system will only be accurate for 8-bit sample data. For 16-bit samples, the practice of performing two table lookups (one for the top half and one for the bottom half) introduces significant inaccuracies.The optimisations
That's a lot of combinations to build and test.
Also, note that although my original profiling highlighted that MV_Mix16BitMono was the main function being used (8-bit mono source to 16-bit mono dest), for this initial round of optimisation I ended up focusing purely on the more complex MV_Mix16BitStereo16 function (16-bit mono source to 16-bit stereo dest). I suspect this was actually a mistake where I just misread my notes and thought it was MV_Mix16BitStereo16 which was taking up all the time and not MV_Mix16BitMono. So later on I may need to do a second round of profiling to make sure that the conclusions I'm drawing from these tests are still valid for the 8-bit routines.Multi-compilation using make
CPUS = soft fpa armv3m-soft armv3m-fpa armv5te-soft armv5te-fpa armv6-vfp armv6-fpa armv7-neon armv7-fpa
OPTS = O2 O3 On
GCC_soft = -mfpu=fpa -msoft-float
GCC_fpa = -mfpu=fpa -mhard-float -mlibscl -DNO_ALIGNED_MALLOC
GCC_armv3m-soft = -march=armv3m -mtune=strongarm $(GCC_soft)
GCC_armv3m-fpa = -march=armv3m -mtune=strongarm $(GCC_fpa)
GCC_armv5te-soft = -march=armv5te -mtune=xscale $(GCC_soft)
GCC_armv5te-fpa = -march=armv5te -mtune=xscale $(GCC_fpa)
GCC_armv6-vfp = -mcpu=arm1176jzf-s -mfpu=vfp -mfloat-abi=softfp
GCC_armv6-fpa = -mcpu=arm1176jzf-s $(GCC_fpa)
GCC_armv7-neon = -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp
GCC_armv7-fpa = -mcpu=cortex-a8 $(GCC_fpa)
GCC_O2 = -O2
GCC_O3 = -O3
GCC_On = -O3 -funroll-loops -fmodulo-sched -ftree-vectorize -ffast-math
SRCS = armv3m armv5te armv6 float integer justlut orig neon
# CPU+Optimisation combinations
CPUOPTS = $(foreach cpu,$(CPUS),$(addprefix $(cpu)_, $(OPTS)))
# Test + CPU + optimisation
DESTS = $(foreach src,$(SRCS),$(addprefix bin/$(src)_, $(CPUOPTS)))
GCCFLAGS = -Wall -static -std=gnu++11
empty =
space = $(empty) $(empty)
all:: $(DESTS)
$(DESTS):
g++ $(GCCFLAGS) $(foreach word,$(wordlist 2,100,$(subst _,$(space),$@)), $(GCC_$(word))) -o $@ $(notdir $(word 1,$(subst _,$(space),$@)).cpp) -MMD -MF d/$(notdir $(basename $@))
-elf2aif $(subst /,.,$@)
# Dependencies
-include d/*Makefile for easy compilation with multiple compiler settings
The above makefile will take each of the source files (armv3m.cpp, integer.cpp, justlut.cpp, etc.) and produce multiple outputs (in a 'bin' folder), each using different compiler settings. E.g. the file orig_armv3m-fpa_O2 corresponds to orig.cpp, built with the GCC_armv3m-fpa and GCC_O2 compiler settings.
The key parts of the makefile are:
The makefile also asks GCC to output dependency information, so that dependencies don't have to the btracked manually. (This will require you to manually create a 'd' folder before running it for the first time)
Using the above makefile, my eight algorithm implementations get compiled to 240 binaries ready for execution. However, note that some of these will be invalid - e.g. NEON intrinsics compiled for an ARMv3 target. In these cases, #defines were used to detect the CPU architecture at compile time, allowing the code to compile cleanly and exit with a simple "unsupported" message at runtime.
With a few more helper scripts (including TSL and !NetTask to allow for remote execution over my home network), I had a setup where I could easily run all the programs on multiple machines and collect together the profiling results.ARM optimisations in detail
ARMv3M
int32_t * __restrict dest32 = (int32_t *) dest;
do
{
int16_t const isample0 = B_LITTLE16(source[position >> 16]);
position += rate;
int32_t mix = *dest32;
int32_t left,right;
uint32_t magic = 0x80000000;
int32_t const sample0 = isample0*leftvol;
int32_t const sample1 = isample0*rightvol;
asm("adds %0,%1,%2,lsl #16\n\tsbcvs %0,%3,#0" : "=r" (left) : "r" (sample0), "r" (mix), "r" (magic) : "cc");
asm("adds %0,%1,%2\n\tsbcvs %0,%3,#0" : "=r" (right) : "r" (sample1), "r" (mix), "r" (magic) : "cc");
*dest32++ = (((uint32_t)left)>>16) | (right & 0xffff0000);
}
while (--length);ARMv3M (StrongARM) optimised mixing loop
Notes:ARMv5TE
int32_t * __restrict dest32 = (int32_t *) dest;
do
{
/* Volumes are negated and halved so that -32768 represents full volume */
uint32_t arm_volumes = (-(leftvol >> 1)) & 0xffff;
arm_volumes |= (-(rightvol >> 1)) << 16;
int16_t const isample0 = B_LITTLE16(source[position >> 16]);
position += rate;
int32_t mix = *dest32;
int32_t left,right;
left = mix << 16;
right = mix;
int32_t left2,right2;
asm ("smulbb %0,%1,%2" : "=r" (left2) : "r" (isample0), "r" (arm_volumes));
asm ("smulbt %0,%1,%2" : "=r" (right2) : "r" (isample0), "r" (arm_volumes));
asm ("qdsub %0,%1,%2" : "=r" (left) : "r" (left), "r" (left2));
asm ("qdsub %0,%1,%2" : "=r" (right) : "r" (right), "r" (right2));
*dest32++ = (((uint32_t)left)>>16) | (right & 0xffff0000);
}
while (--length);ARMv5TE (Iyonix) optimised mixing loop ARMv6
do
{
int16_t const isample0 = B_LITTLE16(source[position >> 16]);
position += rate;
int32_t left = *dest;
int32_t right = *(dest + (MV_RightChannelOffset >> 1));
/* left += (isample0 * leftvol) >> 16 */
/* Of course, this prevents volume boosting */
asm ("smlawb %0,%1,%2,%3" : "=r" (left) : "r" (leftvol), "r" (isample0), "r" (left));
asm ("smlawb %0,%1,%2,%3" : "=r" (right) : "r" (rightvol), "r" (isample0), "r" (right));
asm ("ssat %0,#16,%1" : "=r" (left) : "r" (left));
asm ("ssat %0,#16,%1" : "=r" (right) : "r" (right));
*dest = left;
*(dest + (MV_RightChannelOffset >> 1)) = right;
dest += MV_SampleSize >> 1;
}
while (--length);ARMv6 (Pi 1) optimised mixing loop NEON
/* Volumes are negated and halved so that -32768 represents full volume */
/* Workaround bug in RISC OS GCC 4.7.4: vdup_n_s16, vset_lane_s16 with non-const values fail, so pack the values into a u32 instead */
uint32_t arm_volumes = (-(leftvol >> 1)) & 0xffff;
arm_volumes |= (-(rightvol >> 1)) << 16;
int16x4_t volumes = vreinterpret_s16_u32(vdup_n_u32(arm_volumes));
while(length >= 4)
{
/* Load 4 samples */
int16x4_t isample;
isample = vld1_lane_s16(source + (position>>16), isample, 0);
position += rate;
isample = vld1_lane_s16(source + (position>>16), isample, 1);
position += rate;
isample = vld1_lane_s16(source + (position>>16), isample, 2);
position += rate;
isample = vld1_lane_s16(source + (position>>16), isample, 3);
position += rate;
/* Load and de-interleave dest buffer */
int16x4x2_t mix = vld2_s16(dest);
/* Volume scaling */
int16x4_t sample0 = vqdmulh_lane_s16(isample, volumes, 0); /* (isample * volume * 2) >> 16 */
int16x4_t sample1 = vqdmulh_lane_s16(isample, volumes, 1);
/* Accumulate, using subtraction to counter the negated volume value */
mix.val[0] = vqsub_s16(mix.val[0], sample0);
mix.val[1] = vqsub_s16(mix.val[1], sample1);
/* Interleave & store */
vst2_s16(dest, mix);
dest += 8;
length -= 4;
}NEON integer optimised mixing loop The results
After looking over the test results, I've been able to draw the following conclusions:The path to integer math
I've included two sets of test results for the ARM targets: the 'generic' results are for a binary simply compiled with -O2 (i.e. targetting a generic CPU and using softfloat math), while the 'optimal' version is for -O2 and targeting the machine-specific CPU & FP hardware (or softfloat if there's no FP hardware).
One interesting pair of datapoints on the graph are the Titanium and Pi 1 optimised float results. There's a big jump from 'original' to 'float', and then a relatively flat section leading to 'LUT'. The first jump suggests that these machines really don't like having to do lots of conversions between float and int, while the second jump suggests that (compared to the BB-xM) their FPUs are actually pretty good - the BB-xM sees a much higher performance increase when going from float-only to LUT-only. But testing on more VFP/NEON machines is probably needed before drawing any general conclusions from this.
Also, don't be dismayed at the fact that the optimised Titanium integer version is only 11.4x faster than the original code, while the iyonix manages to reach a 30x improvement. In terms of raw numbers, the Titanium is actually the fastest of all the machines tested.
Raw performance figures (samples per second):
Machine/test Original Float LUT Integer BB-xM generic 685,343.8125 2,892,623.25 9,039,449 12,336,188 BB-xM optimal 1,115,506.375 5,879,865 15,978,302 17,158,516 Iyonix generic 245,856.046875 1,184,831.625 4,338,935 6,229,164.5 Iyonix optimal 256,375.546875 1,255,779.625 6,710,886.5 7,695,970.5 Pi1 generic 305,707.28125 1,335,765.5 4,559,026 6,502,797 Pi1 optimal 892,405.125 6,291,456 7,695,970.5 8,962,188 SA generic 102,650.609375 454,913.65625 1,553,445.875 2,231,012.75 SA optimal 102,902.460938 455,902.625 1,588,751.5 2,279,513 Ti generic 1,198,372.625 5,331,742.5 23,301,688 36,649,260 Ti optimal 4,559,026 29,676,680 37,948,468 51,909,704 x86 generic 9,547,569 20,936,012 27,616,620 33,604,030
From these results we can see that x86 is initially the fastest out there, with 2x the performance of the Titanium. But as soon as the mixed float + integer code is removed the Titanium jumps ahead, even managing to be faster for when the code was compiled for a generic CPU.ARM optimisated versions
So in this case, rather than looking at performance relative to the original routine, the graph is showing performance relative to the fastest routine.
This graph is a bit harder to interpret, but if we break it down by machine then things become clearer:
Raw performance figures (samples per second):
Machine/test Integer ARMv3M ARMv5TE ARMv6 NEON BB-xM baseline 12,336,188 19,418,074 21,559,506 23,519,460 31,457,280 BB-xM optimal 17,158,516 23,068,672 21,762,900 23,519,460 31,457,280 Iyonix baseline 6,229,164.5 9,446,631 10,082,462 - - Iyonix optimal 7,695,970.5 10,280,157 10,082,462 - - Pi1 baseline 6,502,797 11,135,320 12,098,954 13,719,685 - Pi1 optimal 8,962,188 12,991,207 13,981,014 13,719,685 - SA baseline 2,231,012.75 3,226,387.75 - - - SA optimal 2,279,513 3,226,387.75 - - - Ti baseline 36,649,260 94,371,840 128,736,064 207,618,048 268,435,456 Ti optimal 51,909,704 143,270,784 140,509,184 207,618,048 268,435,456 Conclusions
For (ARM) binary releases, this means we're looking at at least two configurations: NEON and ARMv3M. Possibly also ARMv5TE and/or ARMv6 (with VFP) depending on how much of an effect they have on the game as a whole.Other notes
I also ran the tests twice, to measure the effects of different buffer sizes. The results presented in this article are for a small mix buffer (1024 bytes, i.e. 256 frames of 16bit sample pairs, the same as AudioLib is typically configured to use). 4MB of voice data is mixed into this buffer (2 million audio frames), simulating the expected runtime behaviour (small buffer which has lots of voices mixed into it).
The alternative arrangement tested, of having a large (4MB) mix buffer (with 2MB of voice data to mix into it) showed improved performance due to less time spent in the setup code at the start of the mix function. (The setup code typically contains four floating point multiplies, for calculating the two volume scale factors). However, except for the FPA builds, this performance improvement appears to be under 5%, suggesting that there isn't much to gain from tweaking the buffer size / function prologue. Also note that although the mix buffer size was 4MB, it was necessary to perform the mixing in 64K-frame chunks (effective mix buffer size of 256KB), to avoid overflows in the position variable.
And finally, to check that my optimisations hadn't broken the code, each build also contains a unit test that compares the mix output against a known-good reference implementation, for a few different volume settings, and reports the maximum error seen.