Optimization for ARM Processors

14 Mar 2017

Embedded systems have various types of the processors which range from 8-bit family to 64-bit family, from 32 bytes to a few gigabytes of RAM and, from kilohertz clock source to a few gigahertz clock source. When the resources of the processor are increased, its cost is also increased. It may be possible to use cheaper processor by optimising the code. When optimising embedded code, the trade-off is between speed, memory and power consumption. Speed and memory are inversely proportional but power consumption has many variances. As a designer, the aim is to find the optimum point that system is efficient based on speed, memory space and power consumption. This task is a many-objective optimisation problem, and the solution can be easy by sacrificing one of the variance or challenging by finding the optimum point.

Modern embedded processors have the complex architecture to control individual unit by the designer so that complicated compilers are needed to use processor efficiently. ARM DS-5 is an Eclipse-based IDE for Linux-based and bare-metal embedded systems. DS-5 is sophisticated IDE and optimises the code and configures the ARM processors to get the best performance and less code size. The IDE has options to change optimisation setting or activating/deactivating the modules inside the processor. DS-5 optimises the code based on speed or performance. Each option has four optimisation level where –O0 is minimum and –O3 is maximum optimisation. –O3 provides a poor debugging interface. However, compiler optimisation doesn’t guarantee the improvement and they are not reliable as their implementation details can change with the compiler update. It is good practice to write efficient code then trust the compiler to do the best.

One of the most important optimisation technique is loop optimisation because loop iterations usually comprise the big portion of the execution time. Loops can be executed thousands or millions of times such that reduction of the single line of the machine code can contribute significant effect. Decrementing loop and loop unrolling can optimise the code based on the speed.

Decrementing Loop

When the loop termination condition is written count down to zero instead of incrementing from zero, it can reduce the assembly code. It is tested with a dummy code and assembly output is reduced significantly as seen in Figure 1 below.

(edit: There was an error with the image so it has been changed.)

Figure 1 – Decremental Loop vs Incremental Loop

Unlike the traditional RISC machines, the assembly code doesn’t show the exact timing on ARM processors because of the pipelining, caches and other complex system architectures like the NEON engine but it indicates performance is improved when the code lines are decreased. Hence; ARM7 family executes one instruction with the average 1.9 cycles, this will improve the performance and also decrease the code size. The exact timing can be benchmarked by statistical methods or simulation with ARMulator program.

Loop Unrolling

Another method to increase code speed is loop unrolling where loop counter needs to be updated less often or removed entirely by fully unrolling the loop. The disadvantage of the loop unrolling is increasing code size. If the ARM compiler is set to -O3 –Otime, loops are unrolled automatically.

Neon Engine

Neon engine improves the signal processing speed of the CPU by doing parallel processing by enabling the vectorisation at compiler setting. It is SIMD (Single Instruction, Multiple Data) architecture extension. It can improve the performance of the system up to 150%. The Neon engine should be used carefully because it consumes more energy than usual mode but if it is used correctly like executing tasks fast and putting the CPU sleep mode sooner, it improves power efficiency. It is suitable for applications which require intensive calculations like signal or multimedia processing.

16-bit Thumb Instruction Set

ARM processors are an implementation of the reduced instruction set architecture (RISC). The advantage of RISC machine is that it can execute a single instruction per cycle which usually faster than CISC architecture but the disadvantage of the RISC architecture is increased code size. It may require a few instructions for the one instruction in CICS architecture. To decrease the code size, ARM introduced 16-bit Thumb instruction set beside the 32-bit ARM instructions. When the Thumb instruction set is used instructions are stored 16-bit but they are executed as a 32-bit instruction by extending the instruction via dedicated hardware.

On the other hand, power consumption of the embedded systems is another design consideration. Reducing the clock frequency decrease the power consumption of the device. The power consumption can be improved by shutting down unused peripheral clocks and putting CPU in sleep mode after the required tasks are completed. If the NEON engine is not needed, it should be disabled as well as other unused modules. The results I have obtained show that when the optimisation level of the compiler is increased, it is not guaranteed that the code will be optimised. Therefore, we should optimise the code than trust the compiler. I have only mentioned few of the options to improve the speed or memory usage. There are plenty of other things that can improve the system efficiency. For instance, chosen algorithm may change the results significantly or hardware accelerator can boost the system so the optimum system can vary depends on the designed system.

References

ARM Compiler optimisation

Dhrystone and MIPs performance of ARM processors

Optimizing loops

ARM Neon Engine

Introduction to ARM thumb

Top Comments

D_Hersey over 8 years ago +1

i # 0 is the same as i 0 is the only non-true value

D_Hersey over 8 years ago

i # 0
is the same as
i

0 is the only non-true value
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
msimon over 8 years ago in reply to jlangbridge

Thanks jlangbridge for correcting. I copied it from an old post and obviously there is an error. It is only calculating loop condition. I am going to replace it with the ARM documentation. Thanks for catching it and informing.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
jlangbridge over 8 years ago

Decrementing loops are indeed faster on ARM chips, but it is interesting to know why. When a calculation is performed, the ARM automatically makes a comparison to zero, and stores that result in the CPSR. Since you are counting down to zero, there is no longer any need to perform the CMP instruction, since an automatic comparison to zero was already performed. However, there is a problem with the assembly output there, I'm not too sure where you got it from. You want the compiler to perform x = x * i, but there is no multiplication, and SUBS is an exception return. You might have the wrong assembly code for the example...
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel