SBC CPU Throughput

I notice that people are doing some initial benchmarking of BBB and other boards on the RPF forum. Results roughly as expected I guess:

Using just a simple

time echo "scale=2000;4*a(1)" | bc -l

as a lightweight benchmark, I see these numbers reported (smaller Time is better):

[table now updated with extra datapoints reported in current thread below]

Submitter	Time (s)	Board	SoC	Clock (MHz)	O/S
shuckle	26.488	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9
morgaine	25.719	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9+ #272
shuckle	25.009	Raspberry Pi B	BCM2835	700	Raspbian 3.2.27
trn	24.280	Raspberry Pi B	BCM2835	700	Raspbian ?
morgaine	22.456	Raspberry Pi B	BCM2835	800	Raspbian 3.1.9+ #272
morgaine	21.256	Raspberry Pi B	BCM2835	800	Raspbian 3.6.11+ #545, new firmware only
selsinork	21.0	Minnowboard	Atom E640T	1000	Angstrom minnow-2013.07.10.img
shuckle	17.0	Raspberry Pi B	BCM2835	1000	Raspbian ?
morgaine	16.153	BB (white)	AM3359	720	Angstrom v2012.01-core 3.2.5+, user-gov
selsinork	15.850	A20-OLinuXino-MICRO	A20	912	Debian 7.0, 3.4.67+
selsinork	15.328	Cubieboard	A20	912	Ubuntu/Debian 7.1
pluggy	14.510	BBB	AM3359	1000	Debian
morgaine	14.153	BBB	AM3359	1000	Debian 7.0, 3.8.13-bone20, perf-gov
selsinork	13.927	A10-OLinuXino-LIME	A10	1000	Debian 7.0, 3.4.67+
Heydt	13.159	Cubieboard	A10	1000	?
selsinork	12.8	Sabre-lite	i.MX6	1000	Debian armhf
selsinork	12.752	Cubieboard	A20	912	Ubuntu/Debian 7.1 + Angstrom bc
selsinork	12.090	BBB	AM3359	1000	Angstrom dmnd-gov
pluggy	11.923	BBB	AM3359	1000	Angstrom
selsinork	11.86	BBB	AM3359	1000	Angstrom perf-gov
selsinork	9.7	Sabre-lite	i.MX6	1000	Debian armhf + Angstrom bc
selsinork	9.606	Sabre-lite	i.MX6	1000	LFS 3.12, gcc-4.8.2, glibc-2.18

As usual, take benchmarks with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth. Use the numbers wisely, and don't draw inappropriate conclusions.

Morgaine.

Top Replies

Parents

mconners over 12 years ago

Well, this has been interesting. (not really)
Here is what I liked about the data.

1) It cost me nothing
2) It confirmed what I expected

If the data had said that given the problem in question the BBB took twice as long to compute the answer as the RasPi, it may have given me pause and perhaps prompted me to investigate further using other more appropriate benchmarking techniques.

So while at first blush it may come across as "Move along, nothing to see here" data, I find that type of data can be useful as well. Especially when it is free.

Mike
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to mconners

Michael Conners wrote:

So while at first blush it may come across as "Move along, nothing to see here" data, I find that type of data can be useful as well. Especially when it is free.

Exactly. Perfectly good data, perfectly usable in the correct context, and the usual cautions given to discourage people from deriving inappropriate conclusions.

Why anyone would want to vent their ire over this simple gathering of useful data is hard to see, but it clearly hasn't been aimed at being helpful to the forum.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to mcb1

Replying to Mark and coder27 together, as you both commented on phraseology::

Mark, you're quite right that depending on the audience, a more extensive and explanatory discussion about appropriate use of benchmarks could be very useful. My comment was the bare minimum intended for an audience of engineers and techies, who typically know very well already that measuring A doesn't generally allow you to conclude anything about B. It wasn't intended as an explanation, but merely a reminder not to do anything silly with the numbers, and written in humourous style so that no expert would be affronted by teaching grandma to suck eggs.

That said, your suggested comment or wording addresses a very different topic to mine, namely "performance", and "raw power". I purposely said nothing about such things, because you can easily slip into the realm of inappropriate conclusions by assuming that a particular benchmark is a good metric of specific types of performance unless you have checked that it directly measures or correlates properly with them. As a result, your suggestion is much bolder than mine, but also more likely to be wrong without further study. I just advised caution, which is always safe to do.

coder27, you're entirely right that calling this a "poor-man's benchmark" might be underselling the utility of this measurement. That was not intended though, since the measurement was useful to me as I indicated, and I assumed that it would be useful to others as well.

I used the phrase only to mean easy, lightweight, simple, fast, built-in, no package to buy or compile, etc etc, and not implying any criticism whatsoever. After all, if I had been critical of the benchmark or of the results then I wouldn't have gathered and displayed these measurements in the first place! Nevertheless, it's possible that the phrase conveyed the wrong message to some readers, so a more neutral description might have been better, I do agree. The word lightweight seems especially suitable.

coder27 wrote:

Prior to the benchmark results, there were legitimate concerns raised about things like memory bandwidth on a 16-bit bus, but these concerns have been laid to rest.

I've seen that matter raised a number of times myself, so it is indeed very useful to possess numeric data that addresses the issue objectively.

Morgaine

Addendum. After considering the impact of "poor man's" still further, I'm now convinced that your observations require it to be changed to avoid derailing new visitors to the thread. I've changed "poor man's" to "lightweight" in the opening article. Many thanks!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

coder27 wrote:

In any benchmarking activity, there will always be claims that the results might
have turned out differently if a different compiler design or application structure
I'm still trying to get to the bottom of the difference between debian & angstrom on the same hardware

The obvious difference is that Angstrom has built everything with vfpv3 and neonv1 support where debian has vfpv3-d16 and no neon. Since I don't know the insides of bc at all I have no way of knowing if this is even relevant.

readelf -A /usr/bin/bc will show arch specific details
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

just a guess, but Debian shared libraries are normally compiled with -fPIC
(for Position Independent Code), and maybe Angstrom doesn't do that.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

possibly, but short of disassembling the code I'm not sure how to tell.

One other possibility is that the debian version is compiled against readline which also pulls in ncurses, angstrom isn't. So possibly there's a penalty for scanning through directories to find the right termcap files.

Unfortunately debian have screwed with the ABI, they have ld-linux buried under /lib/arm-linux-gnueabihf instead of simply /lib making it hard to take a binary from angstrom and run it on debian.

Anyway, having persuaded the angstrom binary to run on debian on the sabre-lite, it runs in 9.7s vs 12.8 for the debian native version.
It's still using the debian glibc, which probably points more to readline/ncurses rather than -fPIC
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

I think Debian is compiled for ARMv4. Angstrom might be compiled for ARMv7,
which may account for the difference. Angstrom might also be compiled with a
later version of GCC than Debian. I'm pretty sure the vector floating point (vfp)
and neon are not relevant factors.

Edited to add: Looking again at your table, I see Debian listed as ARMHF,
which rules out ARMv4, so maybe that isn't the difference.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to Former Member

selsinork wrote:

Anyway, having persuaded the angstrom binary to run on debian on the sabre-lite, it runs in 9.7s vs 12.8 for the debian native version.

I added your new data point to the table.

This is really cool, I think it might lead to some interesting analysis and end up improving our understanding of the figures.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to Former Member

coder27 wrote:

I think Debian is compiled for ARMv4. Angstrom might be compiled for ARMv7,
which may account for the difference. Angstrom might also be compiled with a
later version of GCC than Debian. I'm pretty sure the vector floating point (vfp)
and neon are not relevant factors.

Edited to add: Looking again at your table, I see Debian listed as ARMHF,
which rules out ARMv4, so maybe that isn't the difference.

The debian gcc was built with these options
--with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard
gcc version 4.7.2 (Debian 4.7.2-4)

the angstrom one is slightly newer and is a Linaro build gcc version 4.7.3 20130205 (prerelease) (Linaro GCC 4.7-2013.02-01)
however it was cross compiled and the particular arch settings don't appear in the output of gcc -v

Since both the BBB and Sabre-Lite are armv7 it seems that it'll be possible to get the same rootfs to run on either, so I should be able to put the debian armhf that I have on the SL onto the BBB and the BBB's angstrom onto the SL.
I'll give it a try and let you know how I get on
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 12 years ago in reply to Former Member

selsinork wrote:

I'm still trying to get to the bottom of the difference between debian & angstrom on the same hardware
Any chance they're using different amounts of L2 cache? From a quick read, it looks like the Sabre-Lite has 1MB L2 cache which the application should easily fit into, but maybe Ångström only allocates 256KB?

General principle: benchmarks that fit into on-chip cache don't test external memory performance
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to johnbeetem

Well yes, cache can be a factor. IME usually about the only thing you can do with cache is turn it on or off, beyond that the cache architecture will govern what gets used and how.
According to the datasheet the AM3359 only has 256KB, so as you say if it fits in that then it should fit in the iMX6's 1Mb. However, we're talking A8 vs A9 and from different implementers. Chances of the cache architecture being somehow different are good.

So yes, maybe it's simply a case of debian being built with every possible knob turned on and therefore suffering from bloat and excessive cache misses.
Angstrom being the roadrunner to debian's coyote

Now all we need is to get Acme Corporation to start building BBB and all our problems will be solved
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to Former Member

selsinork wrote:

Well yes, cache can be a factor. IME usually about the only thing you can do with cache is turn it on or off, beyond that the cache architecture will govern what gets used and how.

Turning off the on-chip cache would eliminate that variable from any tests that want to measure external memory throughput. If it's not feasible during normal running, it might still be possible as a kernel boot option, which would be good enough for testing even if not all that convenient. I haven't come across such a kernel option yet, but it wouldn't surprise me if it were available among the CPU options.

Inevitably turning off the cache will be architecture-dependent, although that's not really an obstacle when the entire point is to make a device-dependent measurement, A quick web search shows that 32-bit Intel CPUs allow you to turn cache off by setting bit 30 of control register cr0. People have done this by loading a kernel module to do the operation, apparently with success judging by the large slowdown.

It doesn't directly address our interest though, which is to measure cached throughput identically on two different devices which may have different cache sizes and may not be caching a given measurement program in the same way. That's substantially harder.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Reply

morgaine over 12 years ago in reply to Former Member

selsinork wrote:

Well yes, cache can be a factor. IME usually about the only thing you can do with cache is turn it on or off, beyond that the cache architecture will govern what gets used and how.

Turning off the on-chip cache would eliminate that variable from any tests that want to measure external memory throughput. If it's not feasible during normal running, it might still be possible as a kernel boot option, which would be good enough for testing even if not all that convenient. I haven't come across such a kernel option yet, but it wouldn't surprise me if it were available among the CPU options.

Inevitably turning off the cache will be architecture-dependent, although that's not really an obstacle when the entire point is to make a device-dependent measurement, A quick web search shows that 32-bit Intel CPUs allow you to turn cache off by setting bit 30 of control register cr0. People have done this by loading a kernel module to do the operation, apparently with success judging by the large slowdown.

It doesn't directly address our interest though, which is to measure cached throughput identically on two different devices which may have different cache sizes and may not be caching a given measurement program in the same way. That's substantially harder.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Children

No Data