SBC CPU Throughput

I notice that people are doing some initial benchmarking of BBB and other boards on the RPF forum. Results roughly as expected I guess:

Using just a simple

time echo "scale=2000;4*a(1)" | bc -l

as a lightweight benchmark, I see these numbers reported (smaller Time is better):

[table now updated with extra datapoints reported in current thread below]

Submitter	Time (s)	Board	SoC	Clock (MHz)	O/S
shuckle	26.488	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9
morgaine	25.719	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9+ #272
shuckle	25.009	Raspberry Pi B	BCM2835	700	Raspbian 3.2.27
trn	24.280	Raspberry Pi B	BCM2835	700	Raspbian ?
morgaine	22.456	Raspberry Pi B	BCM2835	800	Raspbian 3.1.9+ #272
morgaine	21.256	Raspberry Pi B	BCM2835	800	Raspbian 3.6.11+ #545, new firmware only
selsinork	21.0	Minnowboard	Atom E640T	1000	Angstrom minnow-2013.07.10.img
shuckle	17.0	Raspberry Pi B	BCM2835	1000	Raspbian ?
morgaine	16.153	BB (white)	AM3359	720	Angstrom v2012.01-core 3.2.5+, user-gov
selsinork	15.850	A20-OLinuXino-MICRO	A20	912	Debian 7.0, 3.4.67+
selsinork	15.328	Cubieboard	A20	912	Ubuntu/Debian 7.1
pluggy	14.510	BBB	AM3359	1000	Debian
morgaine	14.153	BBB	AM3359	1000	Debian 7.0, 3.8.13-bone20, perf-gov
selsinork	13.927	A10-OLinuXino-LIME	A10	1000	Debian 7.0, 3.4.67+
Heydt	13.159	Cubieboard	A10	1000	?
selsinork	12.8	Sabre-lite	i.MX6	1000	Debian armhf
selsinork	12.752	Cubieboard	A20	912	Ubuntu/Debian 7.1 + Angstrom bc
selsinork	12.090	BBB	AM3359	1000	Angstrom dmnd-gov
pluggy	11.923	BBB	AM3359	1000	Angstrom
selsinork	11.86	BBB	AM3359	1000	Angstrom perf-gov
selsinork	9.7	Sabre-lite	i.MX6	1000	Debian armhf + Angstrom bc
selsinork	9.606	Sabre-lite	i.MX6	1000	LFS 3.12, gcc-4.8.2, glibc-2.18

As usual, take benchmarks with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth. Use the numbers wisely, and don't draw inappropriate conclusions.

Morgaine.

Top Replies

Parents

Former Member over 12 years ago

Olimex A20-OLinuXino-MICRO Debian 7.0 from the downloadable sdcard image on their website.
912MHz A20, 3.4.67+, 15.850s

A10-OLinuXino-LIME, Debian 7.0 from the LIME downloadable image
1000MHz A10, 3.4.67+, 13.927s

I'm guessing that from previous results that either the angstrom version of bc, or my self compiled one will improve these somewhat. (I have some re-compiling to do before I can test that theory)

SATA throughput on both is >100MB/s or as fast as my test drive can provide..
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Reply

Former Member over 12 years ago

Olimex A20-OLinuXino-MICRO Debian 7.0 from the downloadable sdcard image on their website.
912MHz A20, 3.4.67+, 15.850s

A10-OLinuXino-LIME, Debian 7.0 from the LIME downloadable image
1000MHz A10, 3.4.67+, 13.927s

I'm guessing that from previous results that either the angstrom version of bc, or my self compiled one will improve these somewhat. (I have some re-compiling to do before I can test that theory)

SATA throughput on both is >100MB/s or as fast as my test drive can provide..
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Children

morgaine over 12 years ago in reply to Former Member

Great, thanks for those measurements, both added to table!

I notice that the Cortex-A7 seems to be slower than Cortex-A8 on this benchmark generally, after normalizing the figures to account for varying clock speeds and Angstrom bc binaries. Is this something for which we have a solid explanation yet, or do we need to start theorizing?

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to morgaine

Nice little graphic of ARM family evolution:

It doesn't seem to normalize for core counts though, but shows the aggregate throughput of all cores on a SoC together --- good for marketting but not as clear for our purposes.

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

The A7 is basically a tweaked A8, the idea being that it's feature compatible with A15 but much lower power. To achieve that, it's still an in-order architecture, but they made some compromises in order to keep the power consumption low while making the core more easily synthesizable. Remember, the goal of the A7 is more about being the low power part of a big.little SoC with an A15 - the target being a background task processor for something like a smartphone where you don't need a power hungry core eating your battery when you're not actively using it.
There's an interesting discussion here http://www.anandtech.com/show/4991/arms-cortex-a7-bringing-cheaper-dualcore-more-power-efficient-highend-devices
I'm not normally a reader of annandtech, and you'll need to ignore the fanboys from both sides in the comments, but there's a reasonable explanation of some of the reasons A7 is slower in some areas than A8.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

It doesn't seem to normalize for core counts though, but shows the aggregate throughput of all cores on a SoC together --- good for marketting but not as clear for our purposes.

Yep, definately a marketing slide. Performance of the A15 graphed against power consumption of the A7 certainly makes it look good. Once you make the jump from fairly simple in-order execution to complex out-of-order there's a penalty in the form of increased power. Intel discovered that in the Prescott days and dropped back to a simpler architecture starting with Core.
While Arm is ahead on power consumption, that's at the cost of performance. As they start aiming for increased performance and x86 territory, the power consumption will have to increase as well. That's not to say they can't get similar performance with lower power than Intel can today, but who knows what Intel will be doing by then.

That said, I'm finding the A9 powered i.MX6 to be far more performant than I'd expected. Perhaps having my first proper interaction with Arm being the ARM11 based RPi has left me with expectations that are too low.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 12 years ago in reply to Former Member

selsinork wrote:

Morgaine Dinova wrote:

It doesn't seem to normalize for core counts though, but shows the aggregate throughput of all cores on a SoC together --- good for marketing but not as clear for our purposes.

Yep, definitely a marketing slide. Performance of the A15 graphed against power consumption of the A7 certainly makes it look good. Once you make the jump from fairly simple in-order execution to complex out-of-order there's a penalty in the form of increased power. Intel discovered that in the Prescott days and dropped back to a simpler architecture starting with Core.

While Arm is ahead on power consumption, that's at the cost of performance. As they start aiming for increased performance and x86 territory, the power consumption will have to increase as well. That's not to say they can't get similar performance with lower power than Intel can today, but who knows what Intel will be doing by then.

Yes, that's a very pretty slide Morgaine posted. But we know we have to be careful with the term "peak performance", which a clever wag once defined as "a guarantee from the manufacturer that you can't go faster than this".

Speaking of clever aphorisms, you may have heard Hamming's quote: "The purpose of computation is insight, not numbers". Gio Wiederhold transformed this into: "The number of computations without purpose is out of sight." Well, that's what you get with out-of-order speculative execution: you know a bunch of results will get discarded and the power expended to compute them will be wasted. Whenever I see an implementation with long pipelines I think about all those flushes and all those electrons rushing from ground to Vdd acting like a big switched-capacitor resistor. And for what purpose? So that wasteful software has acceptable performance (sigh).
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to johnbeetem

John Beetem wrote:

Well, that's what you get with out-of-order speculative execution: you know a bunch of results will get discarded and the power expended to compute them will be wasted. Whenever I see an implementation with long pipelines I think about all those flushes and all those electrons rushing from ground to Vdd acting like a big switched-capacitor resistor. And for what purpose? So that wasteful software has acceptable performance (sigh).

So true. And software is often wasteful even in places where we don't usually expect it, simply because you can't normally satisfy a wide range of requirements at the same time equally well. Here's an example.

Desktop machines already surpassed (a few years ago) the power they need to implement the desktop metaphor perfectly in respect of performance, meaning that normal "metaphoric paper" operations such as organizing documents and folders are perceptually instantaneous. (More power is needed only by algorithmic operations such as searching, which typically aren't yet perceptually instantaneous.)

And yet, the dumb software merrily runs the CPUs turned up to 11 in order to respond in 100 microseconds instead of in the few milliseconds required by frame rate and our perceptions. It's a waste of power because the efficiency/clock-rate curve is not linear (it's less efficient at higher speeds), and the time saved by doing something faster can't be turned into energy saving by idling sooner because switching to idle is not an instantaneous operation. The combination of these two factors means that the fast CPUs of the last several years waste energy for the bulk of what we do on the desktop.

Morgaine.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Cancel