SBC CPU Throughput

I notice that people are doing some initial benchmarking of BBB and other boards on the RPF forum. Results roughly as expected I guess:

Using just a simple

time echo "scale=2000;4*a(1)" | bc -l

as a lightweight benchmark, I see these numbers reported (smaller Time is better):

[table now updated with extra datapoints reported in current thread below]

Submitter	Time (s)	Board	SoC	Clock (MHz)	O/S
shuckle	26.488	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9
morgaine	25.719	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9+ #272
shuckle	25.009	Raspberry Pi B	BCM2835	700	Raspbian 3.2.27
trn	24.280	Raspberry Pi B	BCM2835	700	Raspbian ?
morgaine	22.456	Raspberry Pi B	BCM2835	800	Raspbian 3.1.9+ #272
morgaine	21.256	Raspberry Pi B	BCM2835	800	Raspbian 3.6.11+ #545, new firmware only
selsinork	21.0	Minnowboard	Atom E640T	1000	Angstrom minnow-2013.07.10.img
shuckle	17.0	Raspberry Pi B	BCM2835	1000	Raspbian ?
morgaine	16.153	BB (white)	AM3359	720	Angstrom v2012.01-core 3.2.5+, user-gov
selsinork	15.850	A20-OLinuXino-MICRO	A20	912	Debian 7.0, 3.4.67+
selsinork	15.328	Cubieboard	A20	912	Ubuntu/Debian 7.1
pluggy	14.510	BBB	AM3359	1000	Debian
morgaine	14.153	BBB	AM3359	1000	Debian 7.0, 3.8.13-bone20, perf-gov
selsinork	13.927	A10-OLinuXino-LIME	A10	1000	Debian 7.0, 3.4.67+
Heydt	13.159	Cubieboard	A10	1000	?
selsinork	12.8	Sabre-lite	i.MX6	1000	Debian armhf
selsinork	12.752	Cubieboard	A20	912	Ubuntu/Debian 7.1 + Angstrom bc
selsinork	12.090	BBB	AM3359	1000	Angstrom dmnd-gov
pluggy	11.923	BBB	AM3359	1000	Angstrom
selsinork	11.86	BBB	AM3359	1000	Angstrom perf-gov
selsinork	9.7	Sabre-lite	i.MX6	1000	Debian armhf + Angstrom bc
selsinork	9.606	Sabre-lite	i.MX6	1000	LFS 3.12, gcc-4.8.2, glibc-2.18

As usual, take benchmarks with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth. Use the numbers wisely, and don't draw inappropriate conclusions.

Morgaine.

Top Replies

Parents

mconners over 12 years ago

Well, this has been interesting. (not really)
Here is what I liked about the data.

1) It cost me nothing
2) It confirmed what I expected

If the data had said that given the problem in question the BBB took twice as long to compute the answer as the RasPi, it may have given me pause and perhaps prompted me to investigate further using other more appropriate benchmarking techniques.

So while at first blush it may come across as "Move along, nothing to see here" data, I find that type of data can be useful as well. Especially when it is free.

Mike
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to mconners

Michael Conners wrote:

So while at first blush it may come across as "Move along, nothing to see here" data, I find that type of data can be useful as well. Especially when it is free.

Exactly. Perfectly good data, perfectly usable in the correct context, and the usual cautions given to discourage people from deriving inappropriate conclusions.

Why anyone would want to vent their ire over this simple gathering of useful data is hard to see, but it clearly hasn't been aimed at being helpful to the forum.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
mcb1 over 12 years ago in reply to morgaine

Guys/Gals
I learnt a lot from this .....um ....discussion.

Not from the bickering about benchmarks, but from coders detailed explanation regarding how some of the inner workings applied.

I note that in the tests for various computers in magazines, they have a number of 'real use' benchmarks, and the results vary, so I've always regarded benchmarks with some concern.

While I can understand the passion about the subject, I think the short single responses were far less threatening,( for us in the observation deck), than a long 'attack' on what seemed like each line.

I would also like to point out, that not everyone understands the inner workings, nor do they need to, to have it actually do something.
You guys all drive cars (I presume) but how many of you know the inner workings of the engine, down to the nuts and bolts and their interaction to make it go.???
For most there some pedals, that have to operated in some form of order, and a thing in the middle you shift around, and fuel that is always running out.

Morgaine
I wonder if a slightly different worded caution about the figures (for the less knowledgable ones that might read it.) might statisfy the masses.
something from Dabs comment maybe.
True performance is always dependant upon the compiler, application structure, operating system, I/O drivers and many other details.

Raw power is seldom the only answer needed for a good system design.

What is the next ... discussion ...

Mark
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to mcb1

Mark,
   I'm happy to see the discussion was useful!

   With regard to a differently worded caution, I think your suggestion misses the point.
This benchmark is a decent benchmark, not just a "poor-man's benchmark" as Morgaine
originally described it, and it produced results which demonstrated a clear winner
in price/performance between BBB and RPi, within its domain of applicability,
which is integer compute bound. The warning should warn against extrapolating
the results beyond the domain of applicability, but not much more than that.

Prior to the benchmark results, there were legitimate concerns raised about
things like memory bandwidth on a 16-bit bus, but these concerns have been
laid to rest.   Similarly, there were concerns raised about the supposed
triviality of the test, and the domain of the test, but I think those concerns
have been shown to be false.

In any benchmarking activity, there will always be claims that the results might
have turned out differently if a different compiler design or application structure
or operating system or I/O drivers, etc., had been used. But in a result this
decisive, those concerns ring hollow.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Reply

Former Member over 12 years ago in reply to mcb1

Mark,
   I'm happy to see the discussion was useful!

   With regard to a differently worded caution, I think your suggestion misses the point.
This benchmark is a decent benchmark, not just a "poor-man's benchmark" as Morgaine
originally described it, and it produced results which demonstrated a clear winner
in price/performance between BBB and RPi, within its domain of applicability,
which is integer compute bound. The warning should warn against extrapolating
the results beyond the domain of applicability, but not much more than that.

Prior to the benchmark results, there were legitimate concerns raised about
things like memory bandwidth on a 16-bit bus, but these concerns have been
laid to rest.   Similarly, there were concerns raised about the supposed
triviality of the test, and the domain of the test, but I think those concerns
have been shown to be false.

In any benchmarking activity, there will always be claims that the results might
have turned out differently if a different compiler design or application structure
or operating system or I/O drivers, etc., had been used. But in a result this
decisive, those concerns ring hollow.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Children

Former Member over 12 years ago in reply to Former Member

coder27 wrote:

In any benchmarking activity, there will always be claims that the results might
have turned out differently if a different compiler design or application structure
I'm still trying to get to the bottom of the difference between debian & angstrom on the same hardware

The obvious difference is that Angstrom has built everything with vfpv3 and neonv1 support where debian has vfpv3-d16 and no neon. Since I don't know the insides of bc at all I have no way of knowing if this is even relevant.

readelf -A /usr/bin/bc will show arch specific details
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

just a guess, but Debian shared libraries are normally compiled with -fPIC
(for Position Independent Code), and maybe Angstrom doesn't do that.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

possibly, but short of disassembling the code I'm not sure how to tell.

One other possibility is that the debian version is compiled against readline which also pulls in ncurses, angstrom isn't. So possibly there's a penalty for scanning through directories to find the right termcap files.

Unfortunately debian have screwed with the ABI, they have ld-linux buried under /lib/arm-linux-gnueabihf instead of simply /lib making it hard to take a binary from angstrom and run it on debian.

Anyway, having persuaded the angstrom binary to run on debian on the sabre-lite, it runs in 9.7s vs 12.8 for the debian native version.
It's still using the debian glibc, which probably points more to readline/ncurses rather than -fPIC
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

I think Debian is compiled for ARMv4. Angstrom might be compiled for ARMv7,
which may account for the difference. Angstrom might also be compiled with a
later version of GCC than Debian. I'm pretty sure the vector floating point (vfp)
and neon are not relevant factors.

Edited to add: Looking again at your table, I see Debian listed as ARMHF,
which rules out ARMv4, so maybe that isn't the difference.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to Former Member

selsinork wrote:

Anyway, having persuaded the angstrom binary to run on debian on the sabre-lite, it runs in 9.7s vs 12.8 for the debian native version.

I added your new data point to the table.

This is really cool, I think it might lead to some interesting analysis and end up improving our understanding of the figures.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

coder27 wrote:

I think Debian is compiled for ARMv4. Angstrom might be compiled for ARMv7,
which may account for the difference. Angstrom might also be compiled with a
later version of GCC than Debian. I'm pretty sure the vector floating point (vfp)
and neon are not relevant factors.

Edited to add: Looking again at your table, I see Debian listed as ARMHF,
which rules out ARMv4, so maybe that isn't the difference.

The debian gcc was built with these options
--with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard
gcc version 4.7.2 (Debian 4.7.2-4)

the angstrom one is slightly newer and is a Linaro build gcc version 4.7.3 20130205 (prerelease) (Linaro GCC 4.7-2013.02-01)
however it was cross compiled and the particular arch settings don't appear in the output of gcc -v

Since both the BBB and Sabre-Lite are armv7 it seems that it'll be possible to get the same rootfs to run on either, so I should be able to put the debian armhf that I have on the SL onto the BBB and the BBB's angstrom onto the SL.
I'll give it a try and let you know how I get on
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 12 years ago in reply to Former Member

selsinork wrote:

I'm still trying to get to the bottom of the difference between debian & angstrom on the same hardware
Any chance they're using different amounts of L2 cache? From a quick read, it looks like the Sabre-Lite has 1MB L2 cache which the application should easily fit into, but maybe Ångström only allocates 256KB?

General principle: benchmarks that fit into on-chip cache don't test external memory performance
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to johnbeetem

Well yes, cache can be a factor. IME usually about the only thing you can do with cache is turn it on or off, beyond that the cache architecture will govern what gets used and how.
According to the datasheet the AM3359 only has 256KB, so as you say if it fits in that then it should fit in the iMX6's 1Mb. However, we're talking A8 vs A9 and from different implementers. Chances of the cache architecture being somehow different are good.

So yes, maybe it's simply a case of debian being built with every possible knob turned on and therefore suffering from bloat and excessive cache misses.
Angstrom being the roadrunner to debian's coyote

Now all we need is to get Acme Corporation to start building BBB and all our problems will be solved
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to Former Member

selsinork wrote:

Well yes, cache can be a factor. IME usually about the only thing you can do with cache is turn it on or off, beyond that the cache architecture will govern what gets used and how.

Turning off the on-chip cache would eliminate that variable from any tests that want to measure external memory throughput. If it's not feasible during normal running, it might still be possible as a kernel boot option, which would be good enough for testing even if not all that convenient. I haven't come across such a kernel option yet, but it wouldn't surprise me if it were available among the CPU options.

Inevitably turning off the cache will be architecture-dependent, although that's not really an obstacle when the entire point is to make a device-dependent measurement, A quick web search shows that 32-bit Intel CPUs allow you to turn cache off by setting bit 30 of control register cr0. People have done this by loading a kernel module to do the operation, apparently with success judging by the large slowdown.

It doesn't directly address our interest though, which is to measure cached throughput identically on two different devices which may have different cache sizes and may not be caching a given measurement program in the same way. That's substantially harder.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel