SBC CPU Throughput

I notice that people are doing some initial benchmarking of BBB and other boards on the RPF forum. Results roughly as expected I guess:

Using just a simple

time echo "scale=2000;4*a(1)" | bc -l

as a lightweight benchmark, I see these numbers reported (smaller Time is better):

[table now updated with extra datapoints reported in current thread below]

Submitter	Time (s)	Board	SoC	Clock (MHz)	O/S
shuckle	26.488	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9
morgaine	25.719	Raspberry Pi B	BCM2835	700	Raspbian 3.1.9+ #272
shuckle	25.009	Raspberry Pi B	BCM2835	700	Raspbian 3.2.27
trn	24.280	Raspberry Pi B	BCM2835	700	Raspbian ?
morgaine	22.456	Raspberry Pi B	BCM2835	800	Raspbian 3.1.9+ #272
morgaine	21.256	Raspberry Pi B	BCM2835	800	Raspbian 3.6.11+ #545, new firmware only
selsinork	21.0	Minnowboard	Atom E640T	1000	Angstrom minnow-2013.07.10.img
shuckle	17.0	Raspberry Pi B	BCM2835	1000	Raspbian ?
morgaine	16.153	BB (white)	AM3359	720	Angstrom v2012.01-core 3.2.5+, user-gov
selsinork	15.850	A20-OLinuXino-MICRO	A20	912	Debian 7.0, 3.4.67+
selsinork	15.328	Cubieboard	A20	912	Ubuntu/Debian 7.1
pluggy	14.510	BBB	AM3359	1000	Debian
morgaine	14.153	BBB	AM3359	1000	Debian 7.0, 3.8.13-bone20, perf-gov
selsinork	13.927	A10-OLinuXino-LIME	A10	1000	Debian 7.0, 3.4.67+
Heydt	13.159	Cubieboard	A10	1000	?
selsinork	12.8	Sabre-lite	i.MX6	1000	Debian armhf
selsinork	12.752	Cubieboard	A20	912	Ubuntu/Debian 7.1 + Angstrom bc
selsinork	12.090	BBB	AM3359	1000	Angstrom dmnd-gov
pluggy	11.923	BBB	AM3359	1000	Angstrom
selsinork	11.86	BBB	AM3359	1000	Angstrom perf-gov
selsinork	9.7	Sabre-lite	i.MX6	1000	Debian armhf + Angstrom bc
selsinork	9.606	Sabre-lite	i.MX6	1000	LFS 3.12, gcc-4.8.2, glibc-2.18

As usual, take benchmarks with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth. Use the numbers wisely, and don't draw inappropriate conclusions.

Morgaine.

Top Replies

Parents

gdstew over 12 years ago

As usual, take benchmarks with a truckload of salt, and evaluate with a suitable mixture of suspicion, snoring, and mirth.

You're absolutely right ! As an indicator or real world application performance this "benchmark" is worthless. Thanks for sharing it.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to gdstew

Data is always good, and sharing it is also good. The warnings are to help people avoid unwarranted conclusions.

And when used properly, synthetic and other artificial benchmarks can be very valuable, for example as a way of checking that an upgrade hasn't altered your compiler optimization defaults. As part of regression testing, they're a very useful engineering tool. You just have to be conscious of their limits, appropriate use versus inappropriate use.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

I think this benchmark is actually pretty decent in a lot of ways.
1) it is trivially easy to run, nothing to download, nothing to compile,
    so you can easily get results from lots of people, allowing you to
    see how consistent the results are, and they seem to be pretty consistent.
2) it is not subject to personal differences in what compiler was used
    to compile it, or what optimization levels or other compiler switches
    were used, although it will exhibit such differences between distros.
3) it doesn't rely on computing the same value over and over in a loop.
     Benchmarks that do that can be overly sensitive to compiler loop
     optimizations, and to just-in-time code-generation techniques.
4) It has a pretty-well understood area of application, integer compute bound.
    Obviously you wouldn't use it to measure floating-point performance, or
    gpu performance, or I/O performance, etc.
5) It uses data that is large enough to show the benefit of large data caches,
     similar to typical user applications.
6) It takes about the right amount of time to run--not so short that the time to
     load the benchmark matters, or that the accuracy of the clock matters,
     and not so long that you can't easily run it several times to see that the
     results are consistent.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
gdstew over 12 years ago in reply to Former Member

1) it is trivially easy to run, nothing to download, nothing to compile,
    so you can easily get results from lots of people, allowing you to
    see how consistent the results are, and they seem to be pretty consistent.

I was totally unaware that triviality in any form was considered a good trait for a benchmark. Doesn't make sense to me though.

2) it is not subject to personal differences in what compiler was used
    to compile it, or what optimization levels or other compiler switches
    were used, although it will exhibit such differences between distros.

Absolutely no way to know this at all. The programs run in the "benchmark" were almost certainly compiled using different versions of GCC with differing levels of optimization
built in and with unknown compile swithes used (most are probably the same, some depend on the CPU) when the OS was built.

3) it doesn't rely on computing the same value over and over in a loop.
     Benchmarks that do that can be overly sensitive to compiler loop
     optimizations, and to just-in-time code-generation techniques.

That is why good synthetic benchmarks consist of many programs. The level of loop optimization available is a good thing to know as is knowing if using
JIT is something you can do if you prefer to use it (see first statement).

4) It has a pretty-well understood area of application, integer compute bound.
    Obviously you wouldn't use it to measure floating-point performance, or
    gpu performance, or I/O performance, etc.

I think all those other things are actually good things to benchmark too since they can all affect applications.

5) It uses data that is large enough to show the benefit of large data caches,
     similar to typical user applications.

Although in the real world this will probably be the exception, not the rule. Yes big caches are good, but using a benchmark
that it executes (or executes mostly) in cache or keeps (most of) its data in cache skews the results too. Something I believe
you mentioned earlier as not being desirable.

6) It takes about the right amount of time to run--not so short that the time to
     load the benchmark matters, or that the accuracy of the clock matters,
     and not so long that you can't easily run it several times to see that the
     results are consistent.

OK. Not really what I consider to be in the top 10 on my list of desirable benchmark traits and pretty much in direct opposition
of getting useful results. The phrase that comes to mind is quick and dirty. Personnaly I don't want to wait a real long time for
results either so I prefer to be able to choose what I need to check and how many iterations they run for resonably repeatable
results.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to gdstew

> The level of loop optimization available is a good thing to know ...

No, its not.

In order to have a benchmark that runs long enough to get decent timings, many benchmarks
put a loop around the code they want to test, on the theory that it will take N times longer
to execute a loop N times than it would to do it once. That theory is just plain wrong
because any decent compiler will hoist as much code as possible outside the loop where
it's only done once. In some cases, the entire contents of the loop can be done only once.
So you think you're measuring the time it takes to do some computation, but you're really not.

It's tempting to think that it doesn't matter, because a compiler that does good loop optimizations
is better than one that doesn't. Which may be true to some extent. But the problem is that
your application most likely doesn't repeatedly do the same calculation over and over in a loop,
so your application won't see the same speedup as your synthetic benchmark.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
gdstew over 12 years ago in reply to Former Member

In order to have a benchmark that runs long enough to get decent timings, many benchmarks
put a loop around the code they want to test, on the theory that it will take N times longer
to execute a loop N times than it would to do it once. That theory is just plain wrong
because any decent compiler will hoist as much code as possible outside the loop where
it's only done once. In some cases, the entire contents of the loop can be done only once.
So you think you're measuring the time it takes to do some computation, but you're really not.

Yes it is good to know where and how much the loops have been inlined, otherwise bad interpretaions of the
results are probable. So I guess that you should know what you are doing to get good results.

That theory is just plain wrong because any decent compiler will hoist as much code as possible outside the loop where
it's only done once.

In most decent compilers the level of inlining avaialable is also usually selectable using one or more compile time directives or
by the compiler itself mainly (but not exclusively) due to code size limitations.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to gdstew

> Yes it is good to know where and how much the loops have been inlined, otherwise bad interpretaions of the results are probable.

Let me spell it out for you. I said nothing about inlining. Inlining is something that applies
to subprograms, and reduces the call/return overhead. It doesn't apply to loops at all,
although there is an optimization called loop unrolling that is similar to subprogram inlining.

The loop optimization I referred to is called loop invariant code hoisting. It involves moving
code from inside the loop to outside (before) the loop, where it is only done once instead
of N times. This optimization prevents you from knowing how long the code takes to run that you
were intending to measure. And it is very significant that this Pi benchmark isn't susceptible
to this optimization.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
gdstew over 12 years ago in reply to Former Member

Loop unrolling is what I meant to say. Loop invariant code hoisting removes code that should not be in the loop to begin with because the results
obtained from executing the code does not change (invariant) with each iteration of the loop so you are just wasting CPU cycles each time it
executes inside the loop. I personally try to keep such code out of loops in the first place because as far as I know it is generally considered to
be bad programming (wastingCPU cycles and all that) and don't understand why you think it is a good idea to keep it in the loop so you can
benchmark it.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to gdstew

> and don't understand why you think it is a good idea to keep it in the loop so you can benchmark it.

Come on. It's not that complicated.

Johnny wanted to know how fast his new computer was. He decided to measure how long it takes
to multiply two numbers together. He wrote a program that did that, but it ran so fast that his
stopwatch was useless. He got a brilliant idea. Put the multiplication inside a loop that iterates
1,000,000 times, and time that, and divide the time by 1,000,000. But his compiler recognized
the multiplication as loop invariant, and hoisted it out of the loop, so his program only ended up
doing one multiplication, and his stopwatch was still useless. Finally he recognized that an
important feature of a benchmark program is that it avoids doing a calculation repeatedly in a loop.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Cancel
gdstew over 12 years ago in reply to Former Member

Come on. It's not that complicated.

Read it again, carefully. Yes you can waste as many CPU cycles as you want to executing code over and over again
inside a loop that produces a result that never changes (that's why it's called invariant) no matter how many times
the code is run in the loop.

Why do you want to ? It serves no purpose, at all other than wasting CPU cycles.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Reply

gdstew over 12 years ago in reply to Former Member

Come on. It's not that complicated.

Read it again, carefully. Yes you can waste as many CPU cycles as you want to executing code over and over again
inside a loop that produces a result that never changes (that's why it's called invariant) no matter how many times
the code is run in the loop.

Why do you want to ? It serves no purpose, at all other than wasting CPU cycles.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Children

Former Member over 12 years ago in reply to gdstew

Johnny put his multiplication statement inside a loop so that his program would
run long enough for his stopwatch to be useful. Turned out his stopwatch still
wasn't useful because the compiler hoisted the multiplication outside of the loop.
So Johnny learned his lesson and from now on insists on benchmarks that
don't involve a loop containing an invariant calculation.

For example, Johnny would much rather benchmark a calculation of pi to 2000 digits
than he would a loop that calculates 200 digits 10 times over.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
gdstew over 12 years ago in reply to Former Member

So Johnny learned his lesson and from now on insists on benchmarks that don't involve a loop
containing an invariant calculation.

Not sure why any good benchmark would do that anyway since it wouldn't produce useful results.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to gdstew

You're exactly right. No good benchmark would do that, which is why I said:

3) it doesn't rely on computing the same value over and over in a loop.
    Benchmarks that do that can be overly sensitive to compiler loop
    optimizations, and to just-in-time code-generation techniques.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel