SBC CPU Throughput

I notice that people are doing some initial benchmarking of BBB and other boards on the RPF forum. Results roughly as expected I guess:

Using just a simple

time echo "scale=2000;4*a(1)" | bc -l

as a lightweight benchmark, I see these numbers reported (smaller Time is better):

[table now updated with extra datapoints reported in current thread below]

Submitter	Time (s)	Board	SoC	Clock (MHz)	O/S
shuckle	26.488	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9
morgaine	25.719	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9+ #272
shuckle	25.009	Raspberry Pi B	BCM2835	700	Raspbian 3.2.27
trn	24.280	Raspberry Pi B	BCM2835	700	Raspbian ?
morgaine	22.456	Raspberry Pi B	BCM2835	800	Raspbian 3.1.9+ #272
morgaine	21.256	Raspberry Pi B	BCM2835	800	Raspbian 3.6.11+ #545, new firmware only
selsinork	21.0	Minnowboard	Atom E640T	1000	Angstrom minnow-2013.07.10.img
shuckle	17.0	Raspberry Pi B	BCM2835	1000	Raspbian ?
morgaine	16.153	BB (white)	AM3359	720	Angstrom v2012.01-core 3.2.5+, user-gov
selsinork	15.850	A20-OLinuXino-MICRO	A20	912	Debian 7.0, 3.4.67+
selsinork	15.328	Cubieboard	A20	912	Ubuntu/Debian 7.1
pluggy	14.510	BBB	AM3359	1000	Debian
morgaine	14.153	BBB	AM3359	1000	Debian 7.0, 3.8.13-bone20, perf-gov
selsinork	13.927	A10-OLinuXino-LIME	A10	1000	Debian 7.0, 3.4.67+
Heydt	13.159	Cubieboard	A10	1000	?
selsinork	12.8	Sabre-lite	i.MX6	1000	Debian armhf
selsinork	12.752	Cubieboard	A20	912	Ubuntu/Debian 7.1 + Angstrom bc
selsinork	12.090	BBB	AM3359	1000	Angstrom dmnd-gov
pluggy	11.923	BBB	AM3359	1000	Angstrom
selsinork	11.86	BBB	AM3359	1000	Angstrom perf-gov
selsinork	9.7	Sabre-lite	i.MX6	1000	Debian armhf + Angstrom bc
selsinork	9.606	Sabre-lite	i.MX6	1000	LFS 3.12, gcc-4.8.2, glibc-2.18

As usual, take benchmarks with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth. Use the numbers wisely, and don't draw inappropriate conclusions.

Morgaine.

Top Replies

morgaine over 12 years ago in reply to morgaine

Nice little graphic of ARM family evolution:

It doesn't seem to normalize for core counts though, but shows the aggregate throughput of all cores on a SoC together --- good for marketting but not as clear for our purposes.

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

The A7 is basically a tweaked A8, the idea being that it's feature compatible with A15 but much lower power. To achieve that, it's still an in-order architecture, but they made some compromises in order to keep the power consumption low while making the core more easily synthesizable. Remember, the goal of the A7 is more about being the low power part of a big.little SoC with an A15 - the target being a background task processor for something like a smartphone where you don't need a power hungry core eating your battery when you're not actively using it.
There's an interesting discussion here http://www.anandtech.com/show/4991/arms-cortex-a7-bringing-cheaper-dualcore-more-power-efficient-highend-devices
I'm not normally a reader of annandtech, and you'll need to ignore the fanboys from both sides in the comments, but there's a reasonable explanation of some of the reasons A7 is slower in some areas than A8.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

It doesn't seem to normalize for core counts though, but shows the aggregate throughput of all cores on a SoC together --- good for marketting but not as clear for our purposes.

Yep, definately a marketing slide. Performance of the A15 graphed against power consumption of the A7 certainly makes it look good. Once you make the jump from fairly simple in-order execution to complex out-of-order there's a penalty in the form of increased power. Intel discovered that in the Prescott days and dropped back to a simpler architecture starting with Core.
While Arm is ahead on power consumption, that's at the cost of performance. As they start aiming for increased performance and x86 territory, the power consumption will have to increase as well. That's not to say they can't get similar performance with lower power than Intel can today, but who knows what Intel will be doing by then.

That said, I'm finding the A9 powered i.MX6 to be far more performant than I'd expected. Perhaps having my first proper interaction with Arm being the ARM11 based RPi has left me with expectations that are too low.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 12 years ago in reply to Former Member

selsinork wrote:

Morgaine Dinova wrote:

It doesn't seem to normalize for core counts though, but shows the aggregate throughput of all cores on a SoC together --- good for marketing but not as clear for our purposes.

Yep, definitely a marketing slide. Performance of the A15 graphed against power consumption of the A7 certainly makes it look good. Once you make the jump from fairly simple in-order execution to complex out-of-order there's a penalty in the form of increased power. Intel discovered that in the Prescott days and dropped back to a simpler architecture starting with Core.

While Arm is ahead on power consumption, that's at the cost of performance. As they start aiming for increased performance and x86 territory, the power consumption will have to increase as well. That's not to say they can't get similar performance with lower power than Intel can today, but who knows what Intel will be doing by then.

Yes, that's a very pretty slide Morgaine posted. But we know we have to be careful with the term "peak performance", which a clever wag once defined as "a guarantee from the manufacturer that you can't go faster than this".

Speaking of clever aphorisms, you may have heard Hamming's quote: "The purpose of computation is insight, not numbers". Gio Wiederhold transformed this into: "The number of computations without purpose is out of sight." Well, that's what you get with out-of-order speculative execution: you know a bunch of results will get discarded and the power expended to compute them will be wasted. Whenever I see an implementation with long pipelines I think about all those flushes and all those electrons rushing from ground to Vdd acting like a big switched-capacitor resistor. And for what purpose? So that wasteful software has acceptable performance (sigh).
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to johnbeetem

John Beetem wrote:

Well, that's what you get with out-of-order speculative execution: you know a bunch of results will get discarded and the power expended to compute them will be wasted. Whenever I see an implementation with long pipelines I think about all those flushes and all those electrons rushing from ground to Vdd acting like a big switched-capacitor resistor. And for what purpose? So that wasteful software has acceptable performance (sigh).

So true. And software is often wasteful even in places where we don't usually expect it, simply because you can't normally satisfy a wide range of requirements at the same time equally well. Here's an example.

Desktop machines already surpassed (a few years ago) the power they need to implement the desktop metaphor perfectly in respect of performance, meaning that normal "metaphoric paper" operations such as organizing documents and folders are perceptually instantaneous. (More power is needed only by algorithmic operations such as searching, which typically aren't yet perceptually instantaneous.)

And yet, the dumb software merrily runs the CPUs turned up to 11 in order to respond in 100 microseconds instead of in the few milliseconds required by frame rate and our perceptions. It's a waste of power because the efficiency/clock-rate curve is not linear (it's less efficient at higher speeds), and the time saved by doing something faster can't be turned into energy saving by idling sooner because switching to idle is not an instantaneous operation. The combination of these two factors means that the fast CPUs of the last several years waste energy for the bulk of what we do on the desktop.

Morgaine.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago

Ok, so some more results for Allwinner based boards, this is all with my minimal LFS based armhf userspace and with essentially identical 3.4.67 kernel from sunxi. I've rebuilt u-boot & the kernel with the latest versions. The cubieboard2 and OLinuXino-A20-Micro use exactly the same kernel but with their own script.fex/script.bin

A10-Lime 11.742s
A20-OLinuXino-Micro 12.145s
A20-Cubieboard2 12.147s

So there's some reasonable improvements on the previous Debian based numbers, and around what we expected. The Cortex-A7 vs Cortex-A8 differential we discussed is easily visible and can only really be explained by architecture and clock speed differences as everything else is effectively identical.
Next step is to put the same code onto a BBB, but I'm not really expecting any difference compared to the A10 based LIME
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago

While our tests here have been limited to a single very specific and easy to run test, they only show one aspect.

For the Allwinner devices, there's some results at http://sunxi.org/Benchmarks It's interesting to compare the openssl speed numbers between A10 & A20 and note that the A10 is significantly faster (and I've confirmed similar numbers between the A10 lime and the A20 Micro).
Then compare the A10 & A20 Linpack where the A20 seems significantly faster. Of course the comparison isn't quite fair as, from the compile options, they appear to be comparing NEON on the A10 against VFPv4 on the A20, the A10 doesn't have VFPv4.

However, having tried identical linpack binaries on both A10 & A20 I'm getting results that suggest the A20 is approx 3x faster for these floating point operations and it doesn't seem to matter if I use VFPv3 or NEON on both, the A20 still outperforms the A10.

The Cortex-A9 based i.MX6 still beats both A10 & A20, but the A20 is surprisingly close, approx 120000 KFLOPS for the A20 compared to approx 150000 KFLOPS for the i.MX6 when using VFPv3 or NEON, the A20 manages approx 145000 KFLOPS with VFPv4. Some of this will be down to raw clock speed difference, 912MHz for the A20, 996MHz for the iMX6.
So it seems that in order to gain feature parity with the Cortex-A15, the Cortex-A7 has been gifted the newer floating point unit in it's entirety. Bonus if you're doing floating point stuff on the A20..
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago

A20 Cubietruck
12.149s

No real surprise that it's very similar to other boards with the A20 chip.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel