Parallella $99 board now open hardware on Github

It's probably spreading everywhere like wildfire, but I just read on Olimex's blog that Adapteva's Parallella kickstarter board now has almost all of its development materials on Github in Parallela and Adapteva repos, and is officially being launched as open hardware.

The 16-core board is priced at US$99 and its host ARM is a dual-core Cortex-A9 (Xilinx Zynq 7010 or 7020). It comes with 1GB DDR3, host and client USB, native gigabit Ethernet and HDMI, so at that price this would be a fairly interesting board even without its 16-core Epiphany coprocessor. (There's a 64-core version planned too.) For more details see the Parallella Reference Manual.

This has all the makings of a pretty fun board. I hope Element 14 has one eye open in that direction.

Morgaine.

PS. Note the 4 x Parallella Expansion Connectors (PEC) on the bottom of the board, illustrated on page 19 of the manual and documented on page 26. They look very flexible for projects, providing access to both Zynq and Epiphany resources.

Top Replies

Parents

morgaine over 12 years ago

Although Adapteva are still fulfilling their Kickstarter committment, their shop is already open for preorders of the 16-core Epiphany board for November delivery. Three options appear to be available:

Board Model
GPIO Xilinx Device
Price
Parallella-16 No GPIO Zynq-7010 $99
Parallella-16 With GPIO Zynq-7010 $119
Parallella-16 With GPIO Zynq-7020 $199

If "No GPIO" means none, zero, zilch, that doesn't appear very enticing, I must say. If this describes the situation accurately, the range of application of the basic board will be a lot narrower than expected. And if the Zynq-7020-based Parallella-16 costs $199, then the price of the Parallella-64 is probably going to be very unfriendly.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

If "No GPIO" means none, zero, zilch, that doesn't appear very enticing, I must say. If this describes the situation accurately, the range of application of the basic board will be a lot narrower than expected. And if the Zynq-7020-based Parallella-16 costs $199, then the price of the Parallella-64 is probably going to be very unfriendly.
Given there's an 'optional upgrade' for the GPIO connectors it seems likely that the difference is simply down to installing the connectors. Any volunteers to hand solder four of those ?

In some ways you can see the reasoning, not having them will not prevent you doing software things on the Epiphany processor. If you really want gpio, and don't care so much about the Epiphany there are probably better boards.

Am I correct in thinking that the only difference between the 7010 and 7020 is more FPGA space ? If so, what's this board really meant to be, a dev board for parallel processing on the Epiphany, or an FPGA dev board ?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to michaelkellett

Michael Kellett wrote:

The Epiphany is a sort of co-processor - it doesn't have peripherals of its own so it's always likely to need glue logic to fit it into a system that does anything useful. The really nice thing about the Zynq is the tightness and quality of the coupling between the FPGA and the ARM cores - nothing else (that I know of) comes close. So if you want the Epiphany to do the hard work for an ARM the Zynq is about the best solution on offer (in terms of performance) so it's a reasonable choice if the main goal is to show off the E at it's best.

But the aim is not to accelerate one specific ARM SoC. Adapteva shouldn't care less about one ARM or another, but only about their Epiphany device because that is where their fortunes lie. The Zynq is really only a cost and a burden on the road to promoting Epiphany, assuming that other reasonable options are available. That is the question I am asking or considering. It's an engineering mistake to say "No" in advance of knowing the interfacing and throughput requirements, just because Zynq is a leader in ARM-FPGA integration.

The FPGA in the Zynq undoubtedly has a maximum throughput vastly exceeding that of the ARM cores, based on our background knowledge of typical FPGAs. This means that the Zynq's dual Cortex-A9 cannot be anywhere near optimum for feeding data through the interface at the highest rate the FPGA can probably sustain. Because of the Zynq's AXI Bus (shown on the whitepaper I linked in post #24), the Zynq is probably very efficient at this, but in the end the data is still being generated by a pair of lowly Cortex-A9 cores clocked at 800MHz. The AXI Bus reduces bottlenecks but it can't speed up the ARMs.

To turn this into an engineering analysis, what we need to know are the Epiphany's interfacing abilities and maximum I/O throughput. For example, if it can be fed only by a single external data source at a time then a dual-core host will not improve matters (other than by being able to dedicate a core to that task). At the other end of the spectrum, if all 12 of Epiphany-16's boundary cores can be fed simultaneously then clearly a dual-core host is barely going to scratch the surface of maximum data throughput. In addition, and orthogonal to the issue of Epiphany's I/O parallelism, one also needs to know the maximum rate at which external data can be fed into Epiphany over each path --- if it's less than a Cortex-A9 core can deliver or significantly greater then there is no particular benefit in using this particular host processor from a throughput perspective. It's all in the details, and can't be judged in advance of knowing them.

On top of all the above, MIMD multiprocessors are notorious for having a throughput that is completely determined by the running application, as I know from personal experience. First of all this is a function of the available parallelism in the problem, secondly it is strongly influenced by the details of the software implementation, and thirdly it is at the mercy of communication and synchronization and read and write throughput within the array. The combination of these will unavoidably mean that the chosen host is optimum for only a tiny faction of the very broad range of problems to which Epiphany can be applied. "Zynq is best" would be an unjustified statement.

And finally, many compute-bound problems require a large amount of MIMD processing but only occasional communication outside of the processing array, and for these even a Pi, BBB or even Arduino could suffice as host. The sheer number of these boards in circulation would make Epiphany an overnight success if the approach taken had been to create simple daughterboards rather than the approach taken with Parallella.

===

Addendum: Looking at the Epiphany E16G301 datasheet, page 1 shows that each of the four eLinks connects directly to the four cores on that side of the array, so by accident it appears that my guess was right that the 12 boundary cores around the periphery of the array have direct connections brought out on the BGA (assuming that the diagram reflects physical reality of course). The internal eMesh network allows any core to be reached from any eLink, but the cores that are directly connected have the fastest access whereas the others have to be routed from core to core internally, which is slower. The diagram on page 6 shows the maximum throughput of each eLink to be 2 GB/s and hence the maximum aggregate external throughput of the chip is 8 GB/s. (The Features bulletpoint list actually says 6.4GB/s, so maybe the 8GB/s includes framing overheads.)

I'm continuing to read the docs. My initial gut feeling is that two 800MHz ARM cores have no chance of feeding the four eLinks at max Epiphany data rate, and therefore the choice of Parallella host SoC doesn't have the aim of doing that. To keep the Epiphany boundary cores from going hungry in a communications-bound application probably requires going off-board and hooking up to other Epiphany devices in a cluster.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to johnbeetem

John Beetem wrote:

I'm actually quite amused (not ROFL -- more like rocking back and forth holding my knees chuckling) at the idea that US$99 is expensive for a Zynq board in 2013.

That's where selsinork's excellent question from post #18 comes in:

selsinork wrote:

what's this board really meant to be, a dev board for parallel processing on the Epiphany, or an FPGA dev board ?

Well? Sure, we like the idea of a Zynq board for $99, but that is most definitely not the point for Adapteva.

(PS. An FPGA board without GPIOs is usually about as useful as a bicycle to a fish except when used as a pure host accelerator, so more generally one should really say "a Zynq board for $119", the mid-price option that provides GPIOs since the $99 board does not. It's sensible to say "an Epiphany board for $99" though, because the Epiphany array is fully and effectively usable even without the GPIO option.)
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

(PS. An FPGA board without GPIOs is usually about as useful as a bicycle to a fish except when used as a pure host accelerator, so more generally one should really say "a Zynq board for $119", the mid-price option that provides GPIOs since the $99 board does not. It's sensible to say "an Epiphany board for $99" though, because the Epiphany array is fully and effectively usable even without the GPIO option.)
The US$99 board gives me the option of using lower-cost GPIO connectors if I don't need the speed of the Parallella's usual Samtec BSH-030-01-FDA sockets, or I want other options. Or I can just populate the one that's connected to Zynq and leave the others unpopulated, saving 75% of the part cost. Always nice to have options.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to johnbeetem

John Beetem wrote:

The US$99 board gives me the option of using lower-cost GPIO connectors if I don't need the speed of the Parallella's usual Samtec BSH-030-01-FDA sockets, or I want other options. Or I can just populate the one that's connected to Zynq and leave the others unpopulated, saving 75% of the part cost. Always nice to have options.

Ah that's good to hear. So it seems the "No GPIO" in the dropdown list for the $99 version doesn't really mean what it says, fortunately. That's a relief, and I'm sure not only to me.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

The FPGA in the Zynq undoubtedly has a maximum throughput vastly exceeding that of the ARM cores, based on our background knowledge of typical FPGAs. This means that the Zynq's dual Cortex-A9 cannot be anywhere near optimum for feeding data through the interface at the highest rate the FPGA can probably sustain. Because of the Zynq's AXI Bus (shown on the whitepaper I linked in post #24), the Zynq is probably very efficient at this, but in the end the data is still being generated by a pair of lowly Cortex-A9 cores clocked at 800MHz. The AXI Bus reduces bottlenecks but it can't speed up the ARMs.
I believe you can also use the the FPGA fabric to talk to the system buses directly without going through the ARM cores, i.e., the FPGA can DMA to shared DRAM and also to peripheral devices like Gigabit Ethernet. This way you can build very high performance data processing beyond what an ARM core can handle, and basically use the ARM to run control software with modest processing requirements.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to johnbeetem

John Beetem wrote:

I believe you can also use the the FPGA fabric to talk to the system buses directly without going through the ARM cores, i.e., the FPGA can DMA to shared DRAM and also to peripheral devices like Gigabit Ethernet.

While true, the Zynq doesn't have a monopoly on DMA. Any reasonable ARM system can be expected to feed its DMA controllers at close to memory rate, and even Cortex-M* microcontrollers commonly feature crossbar-type internal buses so that different types of data transfer can occur in parallel and DMA controllers aren't starved by bus arbitration. In other words, far cheaper ARMs could keep the Epiphany eLinks equally busy through DMA.

Regarding Ethernet, that really comes down to DMA again. There is no room in Epiphany core local memory (just 32KB per core) for full TCP/IP stacks, so the host will have to handle the networking, extract the data out of the protocol payload, and DMA can then fish it out of memory for feeding Epiphany. But again, the Zynq doesn't have any special advantage for this since gigabit MACs are quite common in modern ARM SoCs (less so gigabit PHY, sadly).
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

Regarding Ethernet, that really comes down to DMA again. There is no room in Epiphany core local memory (just 32KB per core) for full TCP/IP stacks, so the host will have to handle networking, extract the data out of the protocol payload, and DMA can then fish it out of memory for feeding Epiphany. But again, the Zynq doesn't have any special advantage for this since gigabit MACs are quite common in modern ARM SoCs (less so gigabit PHY, sadly).
You could probably do wire-speed TCP/IP in the FPGA fabric, using block RAM for table look-up.

How's the power consumption for GBE PHYs these days? Maybe it's better to leave them off SoC so the chips don't get too hot.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to johnbeetem

John Beetem wrote:

You could probably do wire-speed TCP/IP in the FPGA fabric, using block RAM for table look-up.
TCP/IP protocol implemented entirely in FPGA block RAM? You jest ... I hope.

No doubt small and well-defined auxiliary functions could be implemented in the FPGA fabric as part of a TCP offload engine (which are quite common nowadays), but to implement the whole thing in hardware simply doesn't make engineering sense because most parts of TCP/IP are not in the high-speed pathway or are rarely executed.

How's the power consumption for GBE PHYs these days? Maybe it's better to leave them off SoC so the chips don't get too hot.

That was just poor phrasing on my part. I meant that gigabit PHY are less common on ARM boards even when the host SoC provides gigabit MAC. Your point about heat is a good one. Gen0 Parallella recipients were complaining quite a lot about heat.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 12 years ago in reply to morgaine

Morgaine Dinova wrote:

John Beetem wrote:

You could probably do wire-speed TCP/IP in the FPGA fabric, using block RAM for table look-up.
TCP/IP protocol implemented entirely in FPGA block RAM? You jest ... I hope.

No doubt small and well-defined auxiliary functions could be implemented in the FPGA fabric as part of a TCP offload engine (which are quite common nowadays), but to implement the whole thing in hardware simply doesn't make engineering sense because most parts of TCP/IP are not in the high-speed pathway or are rarely executed.
I'm talking about the core packet processing functions like ~~CRC~~ checksum and port numbers and window management. I'm also talking about IPv4 since I don't have experience with IPv6. However, since modern wire-speed routers are implemented in hardware, there's no reason you can't do this with a decent FPGA since managing an end-point is a lot easier than routing. In fact, you can buy TCP/IP IP for various Xilinx FPGAs.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 12 years ago in reply to johnbeetem

John Beetem wrote:

However, since modern wire-speed routers are implemented in hardware, there's no reason you can't do this with a decent FPGA since managing an end-point is a lot easier than routing.

I think you meant the opposite, that routing is a lot easier than managing an endpoint. Routing in hardware needs to handle only the IP layer and can ignore all higher-level detail, which is just payload data at the IP level --- that's why good routers can route frames back-to-back even on 10gig. The routing management protocols and ICMP only come into play at exception or change points, so that's typically left to CPUs to handle at their leisure in all but the highest end backbone routers.

At the endpoints, the entire protocol stack comes into play, which is a heavy burden indeed. TCP offload engines commonly dedicate an embedded CPU to the task rather than hardware, although as we both mentioned, simple functions like CRC are very commonly implemented in hardware, often as a dedicated instruction in the SoC.

That iTOE Verilog for Virtex and Spartan doesn't look like open source to me.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Board Model	GPIO	Xilinx Device	Price
Parallella-16	No GPIO	Zynq-7010	$99
Parallella-16	With GPIO	Zynq-7010	$119
Parallella-16	With GPIO	Zynq-7020	$199

Reply

morgaine over 12 years ago in reply to johnbeetem

John Beetem wrote:

However, since modern wire-speed routers are implemented in hardware, there's no reason you can't do this with a decent FPGA since managing an end-point is a lot easier than routing.

I think you meant the opposite, that routing is a lot easier than managing an endpoint. Routing in hardware needs to handle only the IP layer and can ignore all higher-level detail, which is just payload data at the IP level --- that's why good routers can route frames back-to-back even on 10gig. The routing management protocols and ICMP only come into play at exception or change points, so that's typically left to CPUs to handle at their leisure in all but the highest end backbone routers.

At the endpoints, the entire protocol stack comes into play, which is a heavy burden indeed. TCP offload engines commonly dedicate an embedded CPU to the task rather than hardware, although as we both mentioned, simple functions like CRC are very commonly implemented in hardware, often as a dedicated instruction in the SoC.

That iTOE Verilog for Virtex and Spartan doesn't look like open source to me.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Children

michaelkellett over 12 years ago in reply to morgaine

@Morgaine and John,

I haven't quite got to full TCP/IP in the fpga yet but I'm currently using a Lattice ECP3 to generate multi fragment UDPs sent out at wire speed to GBE, (external Marvell phy and it runs pretty hot - which answers an other question). I can't see that it would ever make sense to do all of the work in the FPGA - things like ARP don't need that kind of speed.
I would love to get my teeth into some TCP acceleration in the FPGA but it is very expensive in terms of development time and we have already hit issues with common GBE network components (like switches) which cant actually handle wire speed data unconditionally - and the conditions are not well specified.
One of the problems you hit with sharing the network stack between processor and FPGA is that you end up writing the entire stack, parts in C and parts in VHDL or Verilog - that's why so far we've kept our end very simple with support for UDP, IP, ARP and not much else.
The phy uses more power than the FPGA and the processor (STM32F407).

MK
Cancel
Vote Up +1 Vote Down

Sign in to reply

Cancel