Raspberry Pi server clusters

One of my current intentions is to play with server clustering once the Raspberry Pi is in volume production and the 1-per-person restrictions are lifted. I have a long-term background in parallelism and concurrency --- my doctoral research was in the topic, and I lectured on it later as well, so it's quite dear to my heart. The very low price of the board makes this feasible with a monetary outlay far below anything else, so I'm really looking forward to an Rpi clustering project.

I'm sure that I'm not the only one thinking about Rpi+clustering. If anyone here has this kind of application in mind, or just general interest in the subject, please keep in touch and post any interesting links you may find on the topic. Once there are millions of the boards around, this could be a very popular area.

Morgaine.

Parents

johnbeetem over 13 years ago

AMD is planning to make 64-bit ARMs for servers.

From ZDNet:

AMD has announced that it is teaming up with ARM to develop 64-bit ARM processors for servers to meet growing challenges for data centers. "AMD will transform the computing data center environment today," said AMD CEO and president Rory Read during a press conference on Monday afternoon, asserting that AMD will be the first company to offer both 64-bit ARM and x86 server processors.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 13 years ago in reply to johnbeetem
More interesting news in this area:

samsung-may-start-making-arm-server-chips [slashdot]
samsung_laying_groundwork_server_chips_analysts_say [computerworld]

One thing that surprises me is that Intel aren't building up a server market presense based on multiple clustered Atom chips. Indeed, Atom seems to be almost a stealth product for them, very low key, and that's pretty odd when the future clearly forecasts competition in power/watt from ARM.

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 13 years ago in reply to johnbeetem

On our earlier topic of ARM versus Atom, this comparison of a new Cortex-A15 versus an Atom from 2011 is rather eye-opening --- http://www.anandtech.com/show/6422/samsung-chromebook-xe303-review-testing-arms-cortex-a15/ .

Executive summary: ARM wins on idle, but consumption is in the same ballpark for both when running flat out. The performance figures favour ARM in this comparison, although one should bear in mind that the Atom in question was an old one.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
Former Member over 13 years ago in reply to morgaine

Morgaine Dinova wrote:

The CEO of ARM, Warren East, says in an interview at http://www.technologyreview.com/news/507116/moores-law-is-becoming-irrelevant/ :

"To me a PC is really just a smartphone in another form factor. [cut] TVs are the same.
TVs are big smartphones. Computers are kind of medium smartphones."

I quote it mainly because it made me chuckle, and although it's to be expected that the ARM CEO would say such things, there's quite a lot of truth in it as well. Computers are intrinsically the same, whatever the niche. And as he says later in the interview, ARM certainly wasn't designed expressly for smartphones.

I just wish ARM would do something a little more explicit in the direction that their heads regularly speak about. Without cluster interconnect becoming available as an optional but integral part of the ARM architecture so that we don't have a Tower of Babel of incompatible interconnects, ARM-based servers will have a hard time becoming ubiquitous.

Morgaine.

I hope that our Warren has his tongue embedded firmly in his cheek, or perhaps he's only concerned with his particular corner of the hardware world. Computet = smartphone = telly? Hmmm... perhaps in consumerland where it's only real tasks are to give access to media, "rich web content" (whatever that is), adverts, spam, oline shopping, more spam and then to become obsolete just in time for next gen. tech then maybe so. But, for folks like me who only really tolerate computers because they are good at doing hard sums very quickly then I fear he's talking cobblers.

If ARM is to become ubiquitous then it will have to offer a bit more than low power (in terms of Watts and flops) at bargain bucket prices. It's a bit of a chicken and egg scenario, where potential adopters don't bite unless they are confident about format longevity and future legacy support (a non-consideration with consumer devices, but essential in industry). Industrial software types may similarly balk at turning out high value, low volume product for a platform that's "not quite done yet" - especially as not all ARM hardware is created equal... The chip makers themselves probably aren't going to toss in features that are currently seen as niche in the hopes of attracting a few customers when consumer grade whatnot and low cost high volume embedded applications are ticking along quite nicely. Oh, I forgot the need for a fit-and-forget operating system that software and hardware manufacturers will have enough confidence in to universally support.

The trick will be to nudge things over that R0>1 tipping point, but there are a bazillion little details (and one big roadmap) to finalise first.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 13 years ago in reply to Former Member

Jonathan Garrish wrote:

The trick will be to nudge things over that R0>1 tipping point, but there are a bazillion little details (and one big roadmap) to finalise first.

I'm glad you pointed out the little issue of roadmap, because it's a very important issue in industry despite having no importance whatsoever in the consumer gadgets sector. Industrial and commercial players need to know that the ARM-based server that they'll buy tomorrow from Dell and others is going to have an evolutionary path for many years ahead before they'll start investing in non-x86 software. Currently there is no indication of a concrete roadmap in that area from ARM whatsoever, AFAIK.

ARM likes dropping vague hints about servers and about ARM licensees delivering the goods through competition in the market, but very oddly they totally fail to realize that they have a crucial role to play in establishing the foundations upon which a server sector will be based. There's a lot more to it than merely defining an ISA and telling licensees to get on with it. That doesn't inspire confidence among prospective buyers at all.

To bring the aggregate performance of a server based on low-power ARM chips up to that of a modern Intel/AMD server requires a lot of cores, and ARM can't use an SMP architecture for this like Intel and AMD are currently doing. Shared memory has extremely limited scalability, and a lot of cores would rapidly hit the ceiling even with fancy multi-level caching architectures (which introduce their own problems anyway, lots of them).

The scalable way for ARM to go is with a clustering approach instead, using on-chip interconnect hardware for parallel communication between cores on the same chip or on the same board without distinction. This would allow server boards to scale to an arbitrary number of cores both on-chip and on the server motherboard. It's not rocket science either, as the transputer pioneered that architecture back in the 80's.

But for that to happen, ARM needs to make the interconnect a standard feature that ARM licensees can add to their ARM SoCs, a standard feature supported by standard instructions so that we don't end up with the Tower of Babel I mentioned above. And I don't see ARM doing anything like that yet.

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 13 years ago in reply to morgaine

It's worth adding that such on-chip interconnect hardware would have tremendous impact far beyond the limited area of ARM server communications. Just imagine the possibilities if your Cortex-A application processors could talk to your Cortex-M microcontrollers at gigabit rates on separate links instead of crawling along at SPI or I2C speeds on shared buses. Suddenly a whole new class of applications becomes possible.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
michaelkellett over 13 years ago in reply to morgaine

If the Transputer architecture was so great how is it that there are no transputers now?
The ideas live on in XMOS and while they aren't bust they are only achieving niche sucesss on a a very small scale.

The reality is that that parallel at the core level is far from sorted - it isn't rocket sicence (we know how to make rockets).

There are lots of core level experimental parallel schemes afoot, GPUs, Greenchip, XMOS, Propeller etc -- none of them seem to be that compelling (except perhaps GPUs).

So - since you pose the question - what are the possibilities of your Cortex A linked to Cortex M that you can't do right now with the Xilinx Zynq ?

Michael Kellett
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
rew over 13 years ago in reply to morgaine

Morgaine Dinova wrote:
I'm glad you pointed out the little issue of roadmap, because it's a very important issue in industry despite having no importance whatsoever in the consumer gadgets sector. Industrial and commercial players need to know that the ARM-based server that they'll buy tomorrow from Dell and others is going to have an evolutionary path for many years ahead before they'll start investing in non-x86 software. Currently there is no indication of a concrete roadmap in that area from ARM whatsoever, AFAIK.
One of the things is that for Intel the server market is an "evolution" of their existing market share. So they can plan ahead and have new processors for the server market in the pipeline.

ARM however, doesn't have a foothold in the server market. If their server-chip-experiment fails, they will end up with a lot of money down the drain, and they'll have to struggle to survive.

In that case, they won't continue throwing money at the dead project. So I understand that they cannot plan beyond their first server-chips.

The problem is that software is SO VERY important that most likely the architecture switch won't happen.
It has been shown time and time again that the installed-base-software-compatible processor wins. Add (slow) hardware X86 emulation support and suddenly you've got a much bigger chance of succeeding because you provide an upgrade path for those having older software. That's what made AMD64 succeed.

(When emulating another architecture, having hardware support for the basics helps a lot. We tried emulating x86 on an architecture our group designed back in the late 1980ies, It turns out 90% of the instructions was dealing with the difference in flag-setting of the emulated instructions compared to the native computer. Having that in hardware speeds things up enormously)....
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 13 years ago in reply to rew

Roger Wolff wrote:

The problem is that software is SO VERY important that most likely the architecture switch won't happen.
It has been shown time and time again that the installed-base-software-compatible processor wins. Add (slow) hardware X86 emulation support and suddenly you've got a much bigger chance of succeeding because you provide an upgrade path for those having older software. That's what made AMD64 succeed.
Pardon me while I fire up my IBM PC XT/370

Actually, these days ARM-based computing devices way outsell x86 computing devices when you include smart phones and tablets. When Google's Dual Cortex-A15 Chromebook starts shipping in ernest for US$249 the inverted pendulum will swing even further in ARM's direction. The x86 will still have its place for people who need higher performance or if software is not available on ARM (such as FPGA design), but most people will do just fine with ARM and ARM will take over those applications just like x86 won out over System/370 due to better price/performance and performance/watt.

JMO/YMMV
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
johnbeetem over 13 years ago in reply to michaelkellett

Michael Kellett wrote:

There are lots of core level experimental parallel schemes afoot, GPUs, Greenchip, XMOS, Propeller etc -- none of them seem to be that compelling (except perhaps GPUs).

So - since you pose the question - what are the possibilities of your Cortex A linked to Cortex M that you can't do right now with the Xilinx Zynq ?
There was a very good article in IEEE Spectrum last year by Peter Kogge on Next-Generation Supercomputers. I found this to be the most interesting take-away:
The good news is that over the next decade, engineers should be able to get the energy requirements of a flop down to about 5 to 10 pJ. The bad news is that even if we do that, it won't really help. The reason is that the energy to perform an arithmetic operation is trivial in comparison with the energy needed to shuffle the data around, from one chip to another, from one board to another, and even from rack to rack.
This has been my experience as well: building high-throughput processing engines is easy. The difficult part is getting operands to them so they can do useful work. This is why DSPs have specialized high-speed multi-port memories, but they're small and only work for small data blocks that get reprocessed many times. A GPU that acts as a SIMD pipeline is also very effective for some applications. But you can't expect a general application to get much sustained performance without a lot of work to make it fit the parallelism of the hardware. It's easy to get peak performance, defined by a wag as "a guarantee from the manufacturer that you won't go faster than this".

IMO the obvious solution is to design the parallel processor's architecture to match the parallelism of the application. If the application is a good match to GPUs, use GPUs. If it's a good match to FPGAs and their huge amount of processing (provided that you can get operands to the processing elements), use FPGAs. However, there's a big non-technical problem: GPU and FPGA vendors won't give you direct access to their architectures, so work in using these incredibly powerful engines for parallel processing is advancing very slowly. It's much easier just to network up a bunch of high-end x86 CPUs and pay the electric bill.

We could do a hell of a lot with a Xilinx Zynq -- if we could program the logic array directly.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 13 years ago in reply to michaelkellett

Michael Kellett wrote:

If the Transputer architecture was so great how is it that there are no transputers now?

The transputer didn't take off back in the 80's simply because it was way ahead of its time, and there was not yet any need for a solution to the problem that it solved. The evolution of the single core microprocessor still had decades of opportunity ahead of it. Increasing the clock speed of a CPU was a comparatively trivial method of increasing its performance, so single CPUs vanquished all other contenders in the industry. No mystery there.

The reality is that that parallel at the core level is far from sorted - it isn't rocket sicence (we know how to make rockets).

That's certainly true. I did my PhD in that very topic, parallelism and concurrency, so I had a first-hand opportunity to experience the many problems in that domain, as well as to examine the very wide spectrum of candidate solutions that people have conjured up to deal with it. There is no shortage of solutions, but you're right that it's "far from sorted" in one particular sense --- although many candidates work just fine, no particular solution has been embraced by the world at large. In part this is a consequence of non-SMP multicore hardware simply not being widely available yet, outside of GPUs.

The bigger problem though is not with technology, but with people. More specifically, it is a problem with people who are so attached to their beloved language that runs well only on a single core that they are not willing to face the fact that languages are just engineering tools, and you need to pick the right tool for the job or you're banging in screws with a hammer. That message fails to get through, probably because the majority of the world's programmers are not engineers at heart but language craftsmen. The message is unwelcome and is rejected.

There are lots of core level experimental parallel schemes afoot, GPUs, Greenchip, XMOS, Propeller etc -- none of them seem to be that compelling (except perhaps GPUs).

GPUs offer one solution to parallelizing computation through schemes such as OpenCL, and it's certainly nice to see that concept gaining traction. The GPU manufacturers realize and implement what the CPU manufacturers are mostly failing to embrace, that the future of computing is to employ thousands or million or billions of cores, and in consequence your programming methodology has to change. Actually, I bet that the CPU manufacturers do realize it, but perhaps don't know how to get past the problem of programmer inertia that I mentioned above.

XMOS may be the start of something interesting, but I think their chances of survival as a minnow in a shark-infested sea are minimal. If they get bought out by a large player eager to take on Intel and AMD then things could get very entertaining. ARM is certainly aware of them through one of their founders, so who knows what the future holds. Perhaps if ARM had a roadmap ... :-)

Propeller is just an eclectic approach to parallelising embedded microcontrollers, and doesn't pretend to be anything else. Although it's cute and quite effective in its domain, it doesn't offer anything for general purpose computing.

So - since you pose the question - what are the possibilities of your Cortex A linked to Cortex M that you can't do right now with the Xilinx Zynq ?

I'm guessing now, but if you meant "Can't the Zynq's FPGA be configured to provide the hardware interconnect?" then the answer is "Yes, but only poorly". Asynchronous serial communication at very high data rates is a task best done by dedicated silicon, and it needs to be supported by the ISA to be most effective. The transputer got that right too, among so many other things.

But the worst part of doing the interconnect in a Zynq's FPGA is that no other ARM licensee will have the technology. Clearly that is no way to create a multi-provider ecosystem. ARM has to define the needed foundations, ie. standard links and a standard ISA to use them.

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel
morgaine over 13 years ago in reply to morgaine

It's possible that those who weren't in the field at the time of the transputer might not picture the interconnect architecture being proposed above, so here's a very brief summary.

Each CPU core has access to either 4 or 6 point-to-point bidirectional self-synchronizing full-duplex serial links, 4 to construct simple 2D sheets and 6 to construct simple 3D volumes (many topologies are of course possible with this number of links, but simple is best as a starting point). These links are optimized for speed of message transfer from the core at one end of the link to the core at the other end, and work identically from a software perspective regardless of whether the messaging is between two cores on a single piece of silicon or two cores on different chips.

Message transfers are handled by scatter-gather DMA controllers, and a core is not involved at all during such transfers in or out of its private memory space. If it requires notification of completion of a transfer in either direction then the interrupt system takes care of it in the normal manner, but this isn't necessary for pure transit messages (those that are just passing through the node). A transit message destined for a different node is automatically passed from the DMA controller on the incoming link to the DMA controller on the outgoing link, and the message just squirts through the node without bothering the CPU. (Implicit in this is that messages carry either destination node numbers or link routing descriptors.)

That's the essence of it. The focus is very much on simplicity and speed in this interconnect, as well as standard functionality. The single-minded goal is to get data from the private address space of one core to that of another core, quickly, in a system with an arbitrary number of cores. Everything else is secondary.

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Reply

morgaine over 13 years ago in reply to morgaine

It's possible that those who weren't in the field at the time of the transputer might not picture the interconnect architecture being proposed above, so here's a very brief summary.

Each CPU core has access to either 4 or 6 point-to-point bidirectional self-synchronizing full-duplex serial links, 4 to construct simple 2D sheets and 6 to construct simple 3D volumes (many topologies are of course possible with this number of links, but simple is best as a starting point). These links are optimized for speed of message transfer from the core at one end of the link to the core at the other end, and work identically from a software perspective regardless of whether the messaging is between two cores on a single piece of silicon or two cores on different chips.

Message transfers are handled by scatter-gather DMA controllers, and a core is not involved at all during such transfers in or out of its private memory space. If it requires notification of completion of a transfer in either direction then the interrupt system takes care of it in the normal manner, but this isn't necessary for pure transit messages (those that are just passing through the node). A transit message destined for a different node is automatically passed from the DMA controller on the incoming link to the DMA controller on the outgoing link, and the message just squirts through the node without bothering the CPU. (Implicit in this is that messages carry either destination node numbers or link routing descriptors.)

That's the essence of it. The focus is very much on simplicity and speed in this interconnect, as well as standard functionality. The single-minded goal is to get data from the private address space of one core to that of another core, quickly, in a system with an arbitrary number of cores. Everything else is secondary.

Morgaine.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Children

No Data