SBC CPU Throughput

I notice that people are doing some initial benchmarking of BBB and other boards on the RPF forum. Results roughly as expected I guess:

Using just a simple

time echo "scale=2000;4*a(1)" | bc -l

as a lightweight benchmark, I see these numbers reported (smaller Time is better):

[table now updated with extra datapoints reported in current thread below]

Submitter	Time (s)	Board	SoC	Clock (MHz)	O/S
shuckle	26.488	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9
morgaine	25.719	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9+ #272
shuckle	25.009	Raspberry Pi B	BCM2835	700	Raspbian 3.2.27
trn	24.280	Raspberry Pi B	BCM2835	700	Raspbian ?
morgaine	22.456	Raspberry Pi B	BCM2835	800	Raspbian 3.1.9+ #272
morgaine	21.256	Raspberry Pi B	BCM2835	800	Raspbian 3.6.11+ #545, new firmware only
selsinork	21.0	Minnowboard	Atom E640T	1000	Angstrom minnow-2013.07.10.img
shuckle	17.0	Raspberry Pi B	BCM2835	1000	Raspbian ?
morgaine	16.153	BB (white)	AM3359	720	Angstrom v2012.01-core 3.2.5+, user-gov
selsinork	15.850	A20-OLinuXino-MICRO	A20	912	Debian 7.0, 3.4.67+
selsinork	15.328	Cubieboard	A20	912	Ubuntu/Debian 7.1
pluggy	14.510	BBB	AM3359	1000	Debian
morgaine	14.153	BBB	AM3359	1000	Debian 7.0, 3.8.13-bone20, perf-gov
selsinork	13.927	A10-OLinuXino-LIME	A10	1000	Debian 7.0, 3.4.67+
Heydt	13.159	Cubieboard	A10	1000	?
selsinork	12.8	Sabre-lite	i.MX6	1000	Debian armhf
selsinork	12.752	Cubieboard	A20	912	Ubuntu/Debian 7.1 + Angstrom bc
selsinork	12.090	BBB	AM3359	1000	Angstrom dmnd-gov
pluggy	11.923	BBB	AM3359	1000	Angstrom
selsinork	11.86	BBB	AM3359	1000	Angstrom perf-gov
selsinork	9.7	Sabre-lite	i.MX6	1000	Debian armhf + Angstrom bc
selsinork	9.606	Sabre-lite	i.MX6	1000	LFS 3.12, gcc-4.8.2, glibc-2.18

As usual, take benchmarks with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth. Use the numbers wisely, and don't draw inappropriate conclusions.

Morgaine.

Top Replies

gdstew over 12 years ago in reply to Former Member

1) it is trivially easy to run, nothing to download, nothing to compile,
    so you can easily get results from lots of people, allowing you to
    see how consistent the results are, and they seem to be pretty consistent.

I was totally unaware that triviality in any form was considered a good trait for a benchmark. Doesn't make sense to me though.

2) it is not subject to personal differences in what compiler was used
    to compile it, or what optimization levels or other compiler switches
    were used, although it will exhibit such differences between distros.

Absolutely no way to know this at all. The programs run in the "benchmark" were almost certainly compiled using different versions of GCC with differing levels of optimization
built in and with unknown compile swithes used (most are probably the same, some depend on the CPU) when the OS was built.

3) it doesn't rely on computing the same value over and over in a loop.
     Benchmarks that do that can be overly sensitive to compiler loop
     optimizations, and to just-in-time code-generation techniques.

That is why good synthetic benchmarks consist of many programs. The level of loop optimization available is a good thing to know as is knowing if using
JIT is something you can do if you prefer to use it (see first statement).

4) It has a pretty-well understood area of application, integer compute bound.
    Obviously you wouldn't use it to measure floating-point performance, or
    gpu performance, or I/O performance, etc.

I think all those other things are actually good things to benchmark too since they can all affect applications.

5) It uses data that is large enough to show the benefit of large data caches,
     similar to typical user applications.

Although in the real world this will probably be the exception, not the rule. Yes big caches are good, but using a benchmark
that it executes (or executes mostly) in cache or keeps (most of) its data in cache skews the results too. Something I believe
you mentioned earlier as not being desirable.

6) It takes about the right amount of time to run--not so short that the time to
     load the benchmark matters, or that the accuracy of the clock matters,
     and not so long that you can't easily run it several times to see that the
     results are consistent.

OK. Not really what I consider to be in the top 10 on my list of desirable benchmark traits and pretty much in direct opposition
of getting useful results. The phrase that comes to mind is quick and dirty. Personnaly I don't want to wait a real long time for
results either so I prefer to be able to choose what I need to check and how many iterations they run for resonably repeatable
results.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to gdstew

> The level of loop optimization available is a good thing to know ...

No, its not.

In order to have a benchmark that runs long enough to get decent timings, many benchmarks
put a loop around the code they want to test, on the theory that it will take N times longer
to execute a loop N times than it would to do it once. That theory is just plain wrong
because any decent compiler will hoist as much code as possible outside the loop where
it's only done once. In some cases, the entire contents of the loop can be done only once.
So you think you're measuring the time it takes to do some computation, but you're really not.

It's tempting to think that it doesn't matter, because a compiler that does good loop optimizations
is better than one that doesn't. Which may be true to some extent. But the problem is that
your application most likely doesn't repeatedly do the same calculation over and over in a loop,
so your application won't see the same speedup as your synthetic benchmark.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
gdstew over 12 years ago in reply to morgaine

The interpretation that people give to good data is an entirely different matter, and my cautions were the usual advice about taking perfectly good numbers and making wholly incorrect conclusions about them.

How can you provide a useful interpretation of data that in your own words you should "take (benchmarks) with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth.". I mean really,
if this is your idea of normal cautions for data I'd like to see your idea a real bona-fide warning.

AFAWK the numbers are totally accurate.

But not really useful for the real world (see previous responses) which is the point you keep dancing around.

Even you have agreed with that,

I agree with what you said, but not that the data in this "benchmark" is good in the sense that a useful interpretation of much of anything is possible using it. Which is
again the point you keep dancing around. It is far too simplistic a benchmark to provide that.

so you're really just looking for a fight as usual.

Something I've seen you do on numerous occasions yourself.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to gdstew

> As to the benchmark itself, no single line program line can be considered as a valuable synthetic benchmark of anything.
...
> I was totally unaware that triviality in any form was considered a good trait for a benchmark. Doesn't make sense to me though.
...
> It is far too simplistic a benchmark to provide that.

You obviously have no clue what this benchmark does. It isn't a "single line program" at all.
Invoking the program takes a single line, and that is a good thing.
There's a very important difference between the command to invoke a program
and the program itself.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
gdstew over 12 years ago in reply to Former Member

In order to have a benchmark that runs long enough to get decent timings, many benchmarks
put a loop around the code they want to test, on the theory that it will take N times longer
to execute a loop N times than it would to do it once. That theory is just plain wrong
because any decent compiler will hoist as much code as possible outside the loop where
it's only done once. In some cases, the entire contents of the loop can be done only once.
So you think you're measuring the time it takes to do some computation, but you're really not.

Yes it is good to know where and how much the loops have been inlined, otherwise bad interpretaions of the
results are probable. So I guess that you should know what you are doing to get good results.

That theory is just plain wrong because any decent compiler will hoist as much code as possible outside the loop where
it's only done once.

In most decent compilers the level of inlining avaialable is also usually selectable using one or more compile time directives or
by the compiler itself mainly (but not exclusively) due to code size limitations.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to gdstew

Gary Stewart wrote:

How can you provide a useful interpretation of data that in your own words you should "take (benchmarks) with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth.".

A rational person can provide a useful interpretation by using ordinary engineering knowledge and commonsense.

I refer you to coder27's post about the data confirming that execution times are proportional to clock rates for a given architecture, and likewise confirming that there is a significant improvement from ARM11 to Cortex-A8 for a given clock rate. These are helpful observations in that they confirm what is expected, and if the numbers had indicated something entirely different then we would have some very serious issues to investigate.

> I wrote:
> AFAWK the numbers are totally accurate.

But not really useful for the real world (see previous responses) which is the point you keep dancing around.

See previous section. It is you who chose to dance around and ignore the useful and helpful interpretations of this data explained well in coder27's post, and instead dived in here directly at me without provocation nor valid reason.

I have consistently stated that good data is useful when used appropriately, but unhelpful when used inappropriately. I clearly said "Results roughly as expected" in my opening post which points to the data being useful, and then I gave the usual cautions about using benchmark data wrongly to reach inappropriate conclusions.

Nobody here has made any inappropriate conclusions from this data, so whatever are you arguing about?

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
shabaz over 12 years ago in reply to Former Member

As far as I know, some open source* and possibly some commercial benchmarks do a similar computation as part of a CPU intensiveness test (at least for single cored processors). Probably, for an embedded app where we may not be interested in say, file I/O, or multimedia extensions or multi-cored results, then it does have some value.

*Source: OSmark
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to gdstew

> Yes it is good to know where and how much the loops have been inlined, otherwise bad interpretaions of the results are probable.

Let me spell it out for you. I said nothing about inlining. Inlining is something that applies
to subprograms, and reduces the call/return overhead. It doesn't apply to loops at all,
although there is an optimization called loop unrolling that is similar to subprogram inlining.

The loop optimization I referred to is called loop invariant code hoisting. It involves moving
code from inside the loop to outside (before) the loop, where it is only done once instead
of N times. This optimization prevents you from knowing how long the code takes to run that you
were intending to measure. And it is very significant that this Pi benchmark isn't susceptible
to this optimization.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
gdstew over 12 years ago in reply to Former Member

Yes actually I do know that it runs as an interpreted program in the bash shell and executes echo to send a math equation to execute and executes the external compiled program (bc) to compute pi to 2000 digits
and time and how long it takes to do so. So you are mainly testing floating point performance. While this is not really simplistic (not really complicated either), it is not a really useful for much of anything other
than a FP benchmark either.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to gdstew

floating point doesn't get you 2000 digits.
Cancel
Vote Up +2 Vote Down

Sign in to reply

Cancel