Does the compiler have memory mapped I/O - at least byte wide?
Does the compiler have memory mapped I/O - at least byte wide?
I'm not a Pi expert but I'll try to help.
The Pi hardware (in common with all similar processors) does not support memory mapped IO in the way that, for example, an ARM Cortex M part does. This is because it is designed to work with DDRAM which is hard enough to get to work in the very special case of a processor or FPGA to memory interface - as far as I know no one uses the DDRAM interface for IO. The usual solution is to provide some additional IO ports which will require some kind of drivers to operate them - they are rarely mapped into normal memory space.
The PI has very limited IO (Beagles are much better) but it is supported by lots of low cost IO expanded boards from many sources but they are all slow.
To be fair the PI is not alone in having this problem - the best way to get data in and out of most computers at high speed is often to use Ethernet or PCI but both require serious hardware.
It's this (hardware) limitation which allows small Cortex M class micros to beat the ****** out of GHz application processors like that on the PI when it comes to low latency fast IO. And even with Gbit Ethernet, although you can get 100Mbytes/s in and out of the processor the round trip latency will be measured in us - perhaps 100x slower than an STM32F4xx can bit bang IO !!!
Roger is correct in implying that it isn't a compiler issue - you'll have the same problem with Python, C or anything else.
What are you trying to do - there are (almost) always ways round these problems.
MK
Thanks for your reply. I have been using a Crystalfonz SOM (CFA10036) which uses an ARM9 iMX287 but the software drivers for I/O are single bit at a time - rather slow for accessing an 8 bit bus. I also use the Silicon Labs ARM3 micros and the I/O is much better on these. In my naivety I assumed the BIGGER micros would be at least the same or better.
Thank you both for your help.
Kind regards
Michael Vos
I don't know too much about that Crystalfonz module, but it is most likely just like the pi.
The CPU has a bunch of GPIO pins. They are possibly grouped by eights, sixteens or thirty-two. For each group, you can "set all outputs with one write".
GPIO->DATA_REGISER = values;
I just checked the datasheet for you, and the broadcom processor surprises me in that it cannot do this.
It has a register to SET and another register to RESET bits in the output register. That on the one hand might slow you down a factor of two, but it also speeds you up more than a factor of two! Now to write your eight-bit bus to the value you want, you simply RESET all 8 bits of your 8 bit bus: GPIO->GPCLR0 = 0x003FC0000; and then write the 8 bits you want to the SET register: GPIO->GPSET0 = data << 22; (assuming your 8bit bus is on GPIO 22 through 29).
The biggest problem is that on the first pi, routing of the PCB influenced which GPIOs came where on the GPIO connector. Thus it might prove difficult to find 8 free GPIOs that are contiguous. So you might need to assemble a few parts of the 8 bits of data into the 32-bit write to the GPSETx register.
So on the GPIO connector we have access to: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27. So, you are in luck! There ARE 8-bit contiguous stretches of GPIO pins available that allow easy single-instruction access to that databus.
One thing you might run into is that some of these have a "special function". So if you need the UART as well, that's on pin 14-15, so that creates a hole there. 2-3 is I2C, 7-11 is SPI. But 20-27 remains together as a byte. It's scattered around physically on the connector, but in software you can do it in one go.
That said... the reason you're seeing things go slow is unlikely related to accessing separate bits in memory mapped IO. Suppose I chose to connect my 8-bit bus on GPIO4, 14, 15, 17, 18, 27, 22,23 because those are nicely together on the connector. then I'd write
GPIO->GPSET0 = (data&0x01)<<(4-0) |
(data&0x02)<<(14-1) |
(data&0x04)<<(15-2) |
(data&0x08)<<(17-3) |
(data&0x10)<<(18-4) |
(data&0x20)<<(27-5) |
(data&0x40)<<(22-6) |
(data&0x80)<<(23-7);
Because we have three "stretches" of contiguous bits, we can simplify this:
GPIO->GPSET0 = (data&0x01)<<(4-0) |
(data&0x06)<<(14-1) |
(data&0x18)<<(17-3) |
(data&0x20)<<(27-5) |
(data&0xc0)<<(22-6);
Now if you know this in advance, you can optimize this even further by choosing longer stretches of contiguous bits. I have written and compiled this, and it turns into twelve instructions. So at 1.2MHz that would evaluate in 10ns! (assuming you're doing this often enough to load the code into cache).
If you are complaining about bitwise access and slowness, then I suspect that you might be accessing the Linux GPIO driver through /sys or someting. THAT is going to be horribly slow no matter what.
I'm glad my contribution has spurred you into making your own !
Have you tried your code out to see how fast it can actually toggle a pin, it is common for IO ports to use a slower bus than the processor core.
BTW - the ST ARM Cortex M processors go a quite well - up to 200MHz core clocks (400 if you count the H7 which you can't buy yet) but they also have multi-pin set and clear registers for GPIO as well as read and write. The fastest toggle rate I've achieved (measured) is 42MHz, ie 11.9ns on, 11.9ns off.
The PI core will certainly go much better (than a 200MHz M4), but the IO is more complicated and I don't have your PI expertise to know what effect this may have.
MK
I'm glad my contribution has spurred you into making your own !
Have you tried your code out to see how fast it can actually toggle a pin, it is common for IO ports to use a slower bus than the processor core.
BTW - the ST ARM Cortex M processors go a quite well - up to 200MHz core clocks (400 if you count the H7 which you can't buy yet) but they also have multi-pin set and clear registers for GPIO as well as read and write. The fastest toggle rate I've achieved (measured) is 42MHz, ie 11.9ns on, 11.9ns off.
The PI core will certainly go much better (than a 200MHz M4), but the IO is more complicated and I don't have your PI expertise to know what effect this may have.
MK
Yes, the IOs could be on a slower bus. External pins are usually limited to about 50Mhz, except for special high-speed special purpose ones. So, if there is a need, you can stuff the bits into the 32-bit word in 10ns, but the pin might take up to 20ns to update.
The 50MHz limitation is not a "we have a register that updates every 20ns". Maybe there is a bus speed limitation, so that consecutive writes to the peripheral are spread out a bit. So on an STM32F405 which runs at 168MHz, I think the IO bus runs at 42MHz max. But on a processor like the one on the pi, I expect that if you want, you can run such an IO bus a bit faster, and the 50MHz is just the IO pin struggling to get the signal from 0V to 3.3V within the 20ns. (or 10ns if you take the 50MHz limit as: Can produce a 50MHz signal which requires a 100MHz edge rate).