High Throughput SPI traffic part 1a - Buffers and DMA

Jan Cumps

16 May 2016

We know SPI as a 4*-wire protocol. But it doesn't have to be.

I'm checking high speed SPI data transfer with buffers, DMA and parallel data lines.

This time, I'm preparing a test bed for SPI with Buffers and DMA. Some hardcore memory probing going on.

Buffers and DMA

Before we look for hardware performance boosters, let's check what we can do on the firmware side.

A typical speed kicker is using buffers and direct memory access (DMA).

We all typically start of our SPI communication with a loop that sends values byte per byte (or word per word).

Many controllers can do better. I'm using a Hercules controller that supports both buffered SPI and DMA.

Buffered SPI means that you can tell the SPI peripheral where your data sits, and how big it is.

The SPI module will then independently (and often without taking up processor power) physically stream the data.

That's already a good thing. It decreases the amount of context switching between controller and SPI module by the number of data elements you manage to buffer.

The second option, DMA, adds another win.

Your SPI module doesn't even need the controller to get at the data in that case.

Even though the data won't be delivered to the destination at the time you post it, you can walk away and do other things.

Use case: Draw a Bitmap on an LCD

I'm going to try this out on a 128 x 128 colour LCD.

I have a naive implementation that sends pixel per pixel via SPI.

I'll try to boost performance first by using a line buffer.

The SPI module can then move this bitmap line as a single operation to the LCD.

The second exercise is to enable DMA, and have the SPI do the exchange all by itself.

In a last phase, I'll try to work with two buffers, and fill the data for line 2 while line 1 is being drawn.

Naive Implementation of the BitMap Draw

The image will be available in our code as a big array of pixels.

The array contains 16-bit values. Each position just contains the RGB colour of that particular pixel.

My LCD is 128*128. My array has 16384 positions.

I used Vladimir Rioson's LCD Image Converter - I'll have to spend another blog on the process to get the data in firmware

I started off with the simplest draw function possible.

For each horizontal line, I iterate through all the pixels and draw them on the correct position.

    for (yy = 0; yy < bmp->height; yy++) {
        for (xx = 0; xx < bmp->width; xx++) {
            point(xx, yy, (bmp->data)[(yy)*bmp->height + (xx) ]);
        }
    }

Each time we call the point() function, we execute 8 SPI calls. That results in 16K * 8 = 131K SPI calls.

It's almost a miracle that it draws in less than a second.

(I'm using my own Hercules Educational BoosterPack MKII LCD library, ported from the Energia one for that display)

Preparation: Special Memory for MDA and Filling the Buffer Line

DMA needs some preparation. On the Hercules, the memory is by default configured to maximise caching.

It only writes info from cache to the real mem when it's needed by the core.

That's cool as long as it's only the ARM core that talks to that memory. But in our DMA set-up, that's not the case.

The SPI module will hit that memory in parallel with the core.

Any cashing scheme that doesn't immediately write through to the physical memory, fails. So we have to make special provisionings.

It would be stupid to make all memory write-through. That would impact the performance of our controller badly.

What we can do instead is to reserve a specific piece of RAM for the buffers, and make that shared.

We'll call this memory area SHAREDRAM. We'll steal this are from the normal configured RAM .

I will create a buffer for one bitmap line. That's 128 colour values.

Because SPI exchanges data both ways during each clock tick, we'll also have to provide a receive button - even though our LCD display will never send data back.

One line is 128 values.

So we'll need one buffer of 128 positions for send data (that's 0x0100 in HEX for a 16-bit array - handy to know that for later). Automatically, we will need to provision a same sized receive buffer.

In total, we'll need the double to accommodate receive data too (although it 'll stay unused).

Here's a snippet of the memory map. This file is generated during the build process, but showing it now will prove that we're good (you can also see that I've reserved 0x1000 in stead of 2 x 0x0100 positions).

MEMORY CONFIGURATION
         name            origin    length      used     unused   attr    fill
----------------------  --------  ---------  --------  --------  ----  --------
...
  SHAREDRAM             0802f000   00001000  00000100  00000f00  RW

From Standard Memory Config to Dedicated DMA Area

When we create a Hercules project for the RM46 controller (the one I'm using here), all RAM is in one chunk.

    RAM     (RW) : origin=0x08001500 length=0x0002EB00

And all that RAM has the following attributes:

For our DMA, we'll need better than shareable. So we'll steal 256 positions from the RAM, and make it write-trough.

You'll need knowledge of your linker command file for that. I've blogged about that;

So we'll first nick room for those 256 positions from our RAM (I'm taking room for more here) , and call that chunk SHAREDRAM.

    RAM       (RW) : origin=0x08001500 length=0x0002DB00
    SHAREDRAM (RW) : origin=0x0802F000 length=0x00001000

(see how we decreased the RAM part by 0x01000?)

Let's now refine the cache properties of that little chunk, and make it write-trough.

Both ARM core and DMA will have the same view of the memory from now on. Guaranteed.

Getting our Buffers in Shared Memory

Now we'll have to place our buffers inside that SHAREDRAM.
We'll first add the directives for the linker in the linker command file.

    .sharedRAM : {} > SHAREDRAM

That will instruct the linker to put all variables that are flagged as sharedRAM to go in our special shared memory location.

In our code, we use a #pragma to flag this special condition.

#pragma SET_DATA_SECTION(".sharedRAM")
uint16_t TXDATA[D_SIZE];         /* transmit buffer in sys ram */
uint16_t RXDATA[D_SIZE]= {0};    /* receive  buffer in sys ram */
#pragma SET_DATA_SECTION()

Everything after the first #pragma goes to the special area. Everything after the second #pragma to the default memory space for that type of source object.

Cool he? We've defined a special 4KB area with special cache handling. And we've managed to move our buffer into that area.

In the memory map file (created by the linker during the build, edited by me to only show the highlights), you'll find the proof.

MEMORY CONFIGURATION
         name            origin    length      used     unused   attr    fill
----------------------  --------  ---------  --------  --------  ----  --------
  RAM                   08001500   0002db00  0000a13c  000239c4  RW  
  SHAREDRAM             0802f000   00001000  00000100  00000f00  RW  

SEGMENT ALLOCATION MAP
run origin  load origin   length   init length attrs members
----------  ----------- ---------- ----------- ----- -------
0802f000    0802f000    00000100   00000000    rw-
  0802f000    0802f000    00000100   00000000    rw- .sharedRAM

SECTION ALLOCATION MAP
 output                                  attributes/
section   page    origin      length       input sections
--------  ----  ----------  ----------   ----------------
.sharedRAM 
*          0    0802f000    00000100     UNINITIALIZED
                  0802f000    00000100     bitmaputils.obj (.sharedRAM:uninit)



GLOBAL SYMBOLS
address   name                                
-------   ----                                
0802f000  TXDATA

One thing that strikes out is: why is there only room allocated for the TXDATA buffer, and none for the RXDATA one?

That's because my code isn't finished, and I've not used the RXDATA buffer in my code yet. The linker kindly (and correctly) discards it.

Final Prep Step: Efficiently Filling the Buffer

Our code has to move a line of bitmap into the buffer efficiently. For us c aficionados, that means using memcpy().

We'll zap one line at a time from the full bitmap into the line array. Here's the code to do that.

    memcpy(&TXDATA, (const void *)(bmp->data + (bmp->width) * row), 2 * D_SIZE);

Not much to be said here. We're copying a number of bytes (2 * D_SIZE) from the address of beginning of a row (bmp->data + (bmp->width) * row) to the buffer (&TXDATA).

Why 2 * D_SIZE? Because memcpy() copies a number of bytes. Our array exists of 16 bit values - and 16 bits is 2 bytes. it takes 2 times the array count to get all data moved.

What have we Achieved Now?

We have a buffer ready to optimise SPI traffic. The buffer is configured with the correct cache behaviour.

And we have a fast method to move bitmap chunks (lines) into that buffer.

We're all set up for the next stage, where we'll try to turn this into a fast SPI funnel.

Disclaimer: I haven't checked yet if my display actually accepts a series of values and paints them one after the other. We mai be in for a surprise if it doesn't.

Anyways, that doesn't impact the approach described here. I'll just not be able to show it for real then ...

The Series
0 - Buffers and Parallel Data Lines
1a - Buffers and DMA
1b - SPI without Buffers
2 - SPI with Buffers
3a - SPI with DMA
3b - SPI with DMA works
4a - SPI Master with DMA and Parallel Data Lines
Hercules microcontrollers, DMA and Memory Cache

Top Comments

balearicdynamics over 9 years ago

Hi Jan! Very well done ! I wait you finish writing all the related articles then I take them as a reference for the PSoC4 vs MSP432 data exchange connection.

Enrico
- Cancel
- Vote Up +2 Vote Down
- Sign in to reply
- More
- Cancel
Jan Cumps over 9 years ago in reply to DAB

DAB, one more clarification:

The configurator doesn't know about the board. It strictly deals with the microcontroller only.
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
Jan Cumps over 9 years ago in reply to shabaz
shabaz, the configurator has a mechanism to retain the round-trip mechanism.
If you only make changes in the dedicated 'user exits', you can switch from code to configurator and back.

On all key locations, the configurator inserts these types of comments in the generated source:
```
/* USER CODE BEGIN (3) */
/* USER CODE END */
```
If you consistently keep your modifications (and overrides) within these sections,
you can keep on switching between source and configurator without issues.
The configurator never touches these sections.
It takes a little discipline to do that and you will be punished by a 1000 pains if you dont't, so that settles itself after some time .
I think that round-trip design where you can switch between source and GUI at will - whatever serves you best - is key for a healthy life.
I try to use the configurator wherever it fits. Only if there's a real-life reason to override its behaviour, I do that.

The one for Hercules (HALCoGen) is thin. It's a graphical layer on top of register and tool chain config.
I mostly just use the config and driver APIs it generates.

I turn to direct register code when HALCoGen doesn't have a solution for me. Because its generated code is stateless, this combination works very well.
- Cancel
- Vote Up +2 Vote Down
- Sign in to reply
- More
- Cancel
Jan Cumps over 9 years ago in reply to DAB
DAB, the configurator I'm showing here for the memory set-up is merely a front-end.
In the end, these settings are converted into instructions for both linker and memory initialisation code.

I regularly get flack for using a GUI to configure these things. But it helps me tremendously.
Nothing stays hidden from the designer. In the end this all becomes source code and tool chain config files.
There are many ways to interfere with the configurator in source code, while keeping the round-trip functionality intact (changes in source don't break the configurator if done properly).

It's also a great learning tool. You can see directly in the code what a configuration change does.

The following config:

Becomes c and assembler code:
```
r11Base  .word 0x0802F000
// ...
```
```
        ; Setup region 11
        mov   r0,  #10
        mcr   p15, #0,    r0, c6, c2, #0
        ldr   r0,  r11Base
        mcr   p15, #0,    r0, c6, c1, #0
        mov   r0,  #0x000C
        orr   r0,  r0,    #0x1300
        mcr   p15, #0,    r0, c6, c1, #4
        movw  r0,  #((1 << 15) + (1 << 14) + (1 << 13) + (0 << 12) + (0 << 11) + (0 << 10) + (0 <<  9) + (0 <<  8) + (0x0B << 1) + (1))
       
```
That doesn't take away my huge respect for people that do everything in source directly. Hats off. But I'm not one of them.
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
shabaz over 9 years ago

Hi Jan,

Nice post, it was very readable!
By you manually modifying the linker file to set up your data segments, did it cause any issues with the GUI configuration of it in
Code Composer Studio, or was it ok with it? (I've not modified my linker file manually for CCS projects before).
- Cancel
- Vote Up +2 Vote Down
- Sign in to reply
- More
- Cancel