SBC Network Throughput

Our earlier lightweight CPU benchmarking provided some confidence that the various boards tested had no major performance faults and were working roughly inline with expectations given their clock speed and processor families. Networking is an area of performance that either doesn't get measured much or that is measured by ad hoc means which are hard to compare, and implementation anomalies are known to occur occasionally.

To try to put this on a more quantitative and even footing, I've picked a network measurement system that has an extremely long pedigree, the TTCP family of utilities. This has evolved from the original "ttcp" of the 1980's through "nttcp" and finally into "nuttcp". It has become a very useful networking tool, simple to use with repeatable results, open source, cross-platform, and it works on both IPv4 and IPv6. It's in the Debian repository, and if the O/S to be tested doesn't have it then it can be compiled from sources just by typing 'make' on the great majority of systems. (I cross-compiled it for Angstrom.)

Usage is extremely simple. A pair of machines is required to test the link between them. One is nominated the 'server' and has "nuttcp -S" executed on it, which turns it into a daemon running in the background. The other is nominated the 'client', and all the tests are run from it regardless of desired direction. The two most common tests to run on the client are a Transmission Test (Tx) using "nuttcp -t server", and a Reception Test (Rx) using "nuttcp -r server", both executed on the client with the hostname or IP address of the 'server' provided as argument.

These simple tests transfer data at maximum rate in the specified direction over TCP (by default), for an interval of approximately 10 seconds, and on completion the measured throughput is returned in Mbps for easiest comparison with the rated Mbps speed of the link. Here is a table showing my initial tests executed on various ARM client boards through a gigabit switch, with the server (nuttcp -S) running on a 2.33GHz Core2 Duo machine possessing a gigabit NIC. The final set of results was obtained between the Core2 Duo and an old Xeon server over a fully gigabit network path, just to confirm that the Core2 Duo wasn't bottlenecked in the ARM board tests.

Max theoretical TCP throughput over 100Mbps Ethernet is 94.1482 Mbps with TCP TimeStamps, or 94.9285 w/o.

For fairness, rows are ordered by 4 attributes: 1) Fast or Gigabit, 2) TCP TS or not, 3) ARM Freq, 4) Rx Speed.

Submitter	Rx Mbps	Tx Mbps	Client Board	SoC	MHz	Limits	O/S, kernel, driver
selsinork	30.60	17.28	233-OLinuXino	i.MX23 ARM926	233	No TS	ArchLinux 3.7.2-2
morgaine	93.84	72.82	RPi Model B	BCM2835	700		Raspbian 3.1.9+ #272
morgaine	93.84	93.75	BB (white)	AM3359	720		Angstrom v2012.01, 3.2.5+
Tim.Annan	94.14	91.74	Gumstix Pepper	AM3359	600	100M mode	Yocto 9.0.0 Dylan, 3.2
morgaine	93.82	76.94	RPi Model B	BCM2835	800		Raspbian 3.1.9+ #272
morgaine	93.82	78.71	RPi Model B	BCM2835	800	7/2012 u/s	Raspbian 3.6.11+ #545
morgaine	94.14	78.87	RPi Model B	BCM2835	800	9/2013 u/s	Raspbian 3.6.11+ #545
morgaine	93.80	93.75	BBB	AM3359	1000		Angstrom v2012.12, 3.8.6
selsinork	93.92	94.46	Cubieboard2	A20	912	VLAN TS	Debian 7.1, 3.3.0+
morgaine	94.16	94.14	BBB	AM3359	1000		Debian 7.0, 3.8.13-bone20
selsinork	94.33	94.55	Cubieboard2	A20	912	No TS	Debian 7.1, 3.3.0+
selsinork	94.91	94.90	BBB	AM3359	1000	No TS	Angstrom 3.8.6
selsinork	94.94	94.91	i.MX53-QSB	i.MX53	996	No TS	3.4.0+
selsinork	243.30	454.88	Sabre-Lite	i.MX6	996	No TS	3.0.15-ts-armv7l
Tim.Annan	257.79	192.22	Gumstix Pepper	AM3359	600	Gbit mode	Yocto 9.0.0 Dylan, 3.2
notzed	371.92	324.49	Parallella-16	Zynq-70x0	800		Ubuntu Linaro
selsinork	525.18	519.41	Cubietruck	A20	1000	No TS	LFS-ARM 3.4.67 + gmac
selsinork	715.63	372.17	Minnowboard	Atom E640	1000	No TS	Angstrom 3.8.13-yocto
morgaine	725.08	595.28	homebuilt	E6550	2330	PCI 33MHz	Gentoo 32-bit, 3.8.2, r8169
selsinork	945.86	946.38	homebuilt	E8200	2666	PCIe X1	32-bit, 3.7.0, e1000

In addition to the results displayed in the table, I also ran servers (nuttcp -S) on all my boards and kicked off transfers in both directions from the x86 machine, and then followed that with board-to-board transfers just to check that the choice of clients and servers was not affecting results. It wasn't, they are very repeatable regardless of the choice, the throughput always being limited by the slowest machine for the selected direction of transfer. Running tests multiple times showed that variations typically held to less than 0.5%, probably a result of occasional unrelated network and/or machine activity.

The above measurements were performed over IPv4. (See below for IPv6.)

Hint: You can run nuttcp client commands even if a server is running on the same machine, so the most flexible approach is to execute "nuttcp -S" on all machines first, and then run client commands on any machine from anywhere to anywhere in any direction.

Initial observations: The great uniformity in BeagleBone network throughput (both white and Black) stands out, and is clearly not affected by CPU clock speed. Raspberry Pi Model B clearly has a problem on transmit (now confirmed to be limited by CPU clock) --- I'll have to investigate this further after upgrading my very old Raspbian version. And finally, my x86 machinery and/or network gear is clearly operating at far below the rated gigabit equipment speed --- this will require urgent investigation and upgrades, especially of NIC bus interfaces.

Confirmation or disproval of my figures would be very welcome, as well as extending the tests to other boards and O/S versions.

Morgaine.

Addendum: Note about maximum theoretical throughput added just above the table after analysis in thread below.

Top Replies

morgaine over 12 years ago in reply to Former Member +1

coder27 wrote: Is your RPi overclocked to 1000? Excellent observation!!! The answer is no --- I wrote "1000" in the table entirely because it has been so long since I've messed significantly with the Pi…

Parents

Former Member over 12 years ago

cubieboard A20

transmit:
113.3673 MB / 10.07 sec = 94.4586 Mbps 12 %TX 10 %RX 0 retrans 0.56 msRTT

receive:
112.5225 MB / 10.05 sec = 93.9224 Mbps 0 %TX 30 %RX 0 retrans 0.59 msRTT
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to Former Member

Added your figures for Cubieboard A20 to the table, thanks!

It's beginning to look asymptotic to around 95Mbps, which has me a bit puzzled. I'm not at the stage of wanting to look at inter-frame gaps on the wire yet, but there may be more than meets the eye at first glance here. After all, we know that our server sides aren't the limiting factor.

selsinork wrote:

[Tests in both directions] simultaneously on the cubie shows a slight drop in transmit to ~92Mbps which could be within normal measurement error range. Receive however appears to decline to ~30Mbps.
Oh, very interesting indeed! Unfortunately it won't be possible to simply conjure up some sort of "maximum combined Rx/Tx fabric throughput" scalar metric, because it may be the case that the fabric is contention-limited only at the highest rates of simultaneous Rx/Tx traffic --- only a family of curves is going to tell the whole story, and that's beyond the kind of measurement work I'm willing to carry out.

For free, anyway.

Fortunately, most loads tend to max out only one direction at a time, so our nuttcp figures for single directions are still useful.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

It's beginning to look asymptotic to around 95Mbps, which has me a bit puzzled.
I was going to suggest that we try a udp test instead of the default tcp since it would remove some of the overheads, however the result of that is exactly 1.0000 Mbps on both the cubie and the x86 system so either I'm doing something wrong or there are other problems with nuttcp. Suggestions welcome. I may try netperf later to see if I can get better results.

the 95Mbps figure is fairly accurate if we assume that nuttcp is measuring payload throughput. 125Mbps raw wire rate, 4B/5B encoding, subtract not just inter-frame gaps but ethernet and tcp headers. TCP will also incur some overhead due to being a reliable protocol and having to deal with sending acks and such like. I'll not pretend to understand how all of that works other than to know that it's potentially complex. UDP avoids most of that since it's send-and-forget.

Someone else already did the numbers for us http://sd.wareonearth.com/~phil/net/overhead/
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Reply

Former Member over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

It's beginning to look asymptotic to around 95Mbps, which has me a bit puzzled.
I was going to suggest that we try a udp test instead of the default tcp since it would remove some of the overheads, however the result of that is exactly 1.0000 Mbps on both the cubie and the x86 system so either I'm doing something wrong or there are other problems with nuttcp. Suggestions welcome. I may try netperf later to see if I can get better results.

the 95Mbps figure is fairly accurate if we assume that nuttcp is measuring payload throughput. 125Mbps raw wire rate, 4B/5B encoding, subtract not just inter-frame gaps but ethernet and tcp headers. TCP will also incur some overhead due to being a reliable protocol and having to deal with sending acks and such like. I'll not pretend to understand how all of that works other than to know that it's potentially complex. UDP avoids most of that since it's send-and-forget.

Someone else already did the numbers for us http://sd.wareonearth.com/~phil/net/overhead/
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Children

morgaine over 12 years ago in reply to Former Member

Excellent link, thanks selsinork! I'll extract a few lines of most relevance here, leaving out the 802.1q lines for brevity as VLANs can be avoided when benchmarking. I've highlighted some fields for reference below.

From http://sd.wareonearth.com/~phil/net/overhead/, assuming no VLANs:

TCP over Ethernet:
     Assuming no header compression (e.g. not PPP)
     Add 20 IPv4 header or 40 IPv6 header (no options)
     Add 20 TCP header
     Add 12 bytes optional TCP timestamps
     Max TCP Payload data rates over ethernet are thus:
          (1500-40)/(38+1500) = 94.9285 % IPv4, minimal headers
          (1500-52)/(38+1500) = 94.1482 % IPv4, TCP timestamps
          (1500-60)/(38+1500) = 93.6281 % IPv6, minimal headers
          (1500-72)/(38+1500) = 92.8479 % IPv6, TCP timestamps

          (9000-40)/(38+9000) = 99.1370 % Jumbo IPv4, minimal headers
          (9000-52)/(38+9000) = 99.0042 % Jumbo IPv4, TCP timestamps
          (9000-60)/(38+9000) = 98.9157 % Jumbo IPv6, minimal headers
          (9000-72)/(38+9000) = 98.7829 % Jumbo IPv6, TCP timestamps

UDP over Ethernet:
     Add 20 IPv4 header or 40 IPv6 header (no options)
     Add 8 UDP header
     Max UDP Payload data rates over ethernet are thus:
          (1500-28)/(38+1500) = 95.7087 % IPv4
          (1500-48)/(38+1500) = 94.4083 % IPv6

          (9000-28)/(38+9000) = 99.2697 % Jumbo IPv4
          (9000-48)/(38+9000) = 99.0485 % Jumbo IPv6

Theoretical maximum UDP throughput on GigE using jumbo frames:
          (9000-20-8)/(9000+14+4+7+1+12)*1000000000/1000000 = 992.697 Mbps

Theoretical maximum TCP throughput on GigE without using jumbo frames:
          (1500-20-20-12)/(1500+14+4+7+1+12)*1000000000/1000000 = 941.482 Mbps

Theoretical maximum UDP throughput on GigE without using jumbo frames:
          (1500-20-8)/(1500+14+4+7+1+12)*1000000000/1000000 = 957.087 Mbps

Because the interframe gap in bit-times and the 4B/5B encoding are the same for both 100Mbps and 1Gbps Ethernet, the percentage calculations apply identically at both speeds when determining the maximum payload rate, ie. one just has to apply the percentage to the data bitrate over the link. This is clear from the fact that, taking TCP as an example, the cited 941.482 Mbps is just 94.1482 % of 1Gbps. Likewise, at 100Mbps the corresponding maximum payload rate of TCP would be 94.1482 Mbps.

(If examining the traffic on the wire, we would have to consider the higher symbol rate on the physical link which for 4B/5B encoding is 125% of the data bitrate. However, this doesn't matter at the layer above on each host, since the data bitrate of a "100Mbps Ethernet" really is 100Mbps after decoding.)

So, our measurements of around 94.1 Mbps seem to indicate that the BBB and Cubieboard2 reach their theoretical limit of performance for "IPv4, TCP timestamps" over 100Mbps Ethernet, although the absolute maximum of 94.9285 Mbps for "IPv4, minimal headers" still remains to be reached. (Of course, we'll have to check whether TCP timestamps are actually being sent to understand the results fully.) Your measurements of 94.46Mbps for Cubieboard2 Tx and 946.38Mbps for Asus Tx show that this very highest limit of throughout is being approached.

So, a very good result!

PS. I've placed a one-line note about the 94.9285 Mbps limit just above the table to save wear and tear on eyeballs.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

Cubieboard2 reach their theoretical limit of performance for "IPv4, TCP timestamps" over 100Mbps Ethernet, although the absolute maximum of 94.9285 Mbps for "IPv4, minimal headers" still remains to be reached.
On the cubieboard, cat /proc/sys/net/ipv4/tcp_timestamps shows as 1, on my x86 systems it's 0. So as the server end is running on the x86, it's certainly possible for some disparity between TX & RX simply due to their being flaws in our methods - the two ends have different settings.

theoretical max and reality are always going to disagree in one way or another and we could spend weeks trying to work out a 3Kbps difference. Probably not worth it.

as an aside, my cubie2 results have a router, a NAT layer, and tagged vlans on the link from router to switch all sitting inbetween server and client. So given the additional overheads, still an impressive enough result.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to morgaine

I wrote:
(Of course, we'll have to check whether TCP timestamps are actually being sent to understand the results fully.)
The first step towards that is to check whether the systems under test have TCP timestamps enabled. That's easy to discover on any normal Linux system by executing:

cat /proc/sys/net/ipv4/tcp_timestamps
All of my machines have TCP timestamps enabled (value 1), including the four ARM boards I tested. It seems to be the default in Linux. So, our limiting TCP throughput is 94.1482 Mbps. (The final confirmation will be when we see TCP timestamps in Wireshark.)

Addendum. Oops, nice overlap between our two posts there.

It's interesting to hear that one of your x86 machines does not have it enabled. That's easily remedied, but unfortunately it means that we can't be sure whether your previous results included timestamps or not. Clearly they could be enabled everywhere or disabled everywhere for the simplest setup. The other detail is more troubling: because you're using VLAN tagging, your throughput limits will be different yet again, 93.9040 Mbps if TCP timestamps are being sent. I've added "VLAN" under "Limits" for Cubieboard2.

Addendum 2. Although the article you linked does not specify throughput for the case of 802.1q without TCP timestamps, the missing lines are easy to calculate:

TCP
    (1500-40)/(42+1500) = 94.6822 % 802.1q, IPv4, without TCP timestamps
    (1500-60)/(42+1500) = 93.3852 % 802.1q, IPv6, without TCP timestamps
    (9000-40)/(42+9000) = 99.0931 % Jumbo 802.1q, IPv4, without TCP timestamps
    (9000-60)/(42+9000) = 98.8719 % Jumbo 802.1q, IPv6, without TCP timestamps
So, it seems that 94.6822 Mbps and 946.822 Mbps will be the limiting throughputs for the non-timestamping x86 server on your VLAN.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

All of my machines have TCP timestamps enabled (value 1), including the four ARM boards I tested. It seems to be the default in Linux.

[...]

It's interesting to hear that one of your x86 machines does not have it enabled.
Actually, none of my x86 systems have it enabled.. remember when I said that I have mostly self compiled systems ? It's likely I make different choices from the distro default of "turn on all the crap & bloat". Enabling something that steals my bandwidth doesn't sound like me

That's easily remedied, but unfortunately it means that we can't be sure whether your previous results included timestamps or not.
I'd suggest they do not include timestamps, after reading RFC1323, but I don't know enough about the implementation detail to be sure. I can put wireshark in the middle and find out though.

I'll redo the tests with both client and server on the same network to remove vlan, nat and timestamp stuff when I get time.

The current setup is mainly down to the main machines having public IP's, but not having enough free for the growing collection of Arm devices which therefore have to sit behind NAT. It'll take a bit of messing around to get a capable machine onto the same lan without other complications like Open vSwitch making interpreting the results even more interesting.

The x86 to x86 results are good as those have uncomplicated networking, but should be taken as having timestamps disabled.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to Former Member

selsinork wrote:

Actually, none of my x86 systems have it enabled.. remember when I said that I have mostly self compiled systems ? It's likely I make different choices from the distro default of "turn on all the crap & bloat".

Networking aside, it is indeed a very bad trend. Although I don't use "desktop" Linux distributions myself, it's very sad to see major players like Gnome, KDE and Ubuntu taking Linux in the opposite direction from being slim and functional by composition to being fat and dysfunctional through adding layers upon layers and forever feeding the GUI monster. Open software designers seem to have lost all esteem for inherent power-by-design, and now worship the power-by-adding-features meme common in other operating systems instead.

The end result of this mess is not only bloat, but also something far worse --- an explosion of dependencies. Adding features to applications very commonly brings in yet another suite of libraries, and so the dependency tree grows and grows and dependency management becomes ever more problematic. At the end of this road lies a future of "everything depends on everything else". Althought Gentoo automates dependency management, it doesn't try to hide away the very strong smell of this malaise, so I see the signs of impending calamity every few months on upgrades. Software is heading along a road full of potholes and ending in a precipice.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

and now worship the power-by-adding-features meme common in other operating systems instead.
Useful features are one thing, however the trend seems to be for things that nobody needs, wants, or will ever have any use for.

But lets not get too off-topic, we could discuss just this aspect for weeks
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel