Tutorial: Adafruit WS2812B NeoPixels with the Freescale FRDM-K64F Board – Part 5: DMA

Posted on August 5, 2015 by Erich Styger

This is Part 5 of a Mini Series. In Part 4, I described how to set up the FTM (Kinetis Flex Timer Module) to generate the required waveforms used for DMA operations (see “Tutorial: Adafruit WS2812B NeoPixels with the Freescale FRDM-K64F Board – Part 4: Timer“). In this post I describe how to use to trigger DMA (Direct To Memory) events. The goal is to drive Adafruit’s NeoPixel (WS2812B) with the Freescale FRDM-K64F board:

FRDM-K64F with Adafruit NeoPixel

Mini Series Tutorial List

Outline

In this article I use DMA (Direct Memory Access) to do memory to memory operations to generate the required bit stream for the WS2812B LEDs. In the previous tutorial I have used the FTM of the FRDM-K64F device to generate three signals:

Waveforms and Timing

I will use the ‘falling edge’ of the signals to trigger DMA transfers, marked as ‘M’ in the following timing diagram:

Driving Bits with DMA

In this post I’m using Kinetis Design Studio v3.0.0 with the Kinetis SDK v1.2.

We will setup this whole engine later in this article. First let’s to the easy thing: configure the GPIO pin to the DIN of the LEDs.

GPIO Port

To generate the signal to DIN of the NeoPixel/WS2812, I can use a normal GPIO (General Purpose Input/Output) pin. If I use multiple pins on such a GPIO port, I can drive multiple ‘lanes’ of pixel arrays.

💡 I need 24 bits to each LED/pixel (8bits for red, green and blue each). Due the nature of writing bytes to the GPIO Port, I need 3 bytes of memory (usually RAM) for each LED. So having a lot of LED’s means a lot of RAM. With just one lane, only one bit in each byte is used. But if I have 8 lanes (say port bits 0 to 7), then I can still need 3 bytes for each pixel, but I can drive 8 LEDs with these three bytes. So if you have many, many LED’s, use multiple lanes to combine them. This not only reduces the amount of memory needed, it reduces as well the time needed to send the bit stream.

To use the GPIO port, I need to:

Mux the Pin to the port used. Basically this means to route the port internal signal to the external pin.
Clock the port (enable the clock). Accessing the port registers without having it clocked will case a hard fault.
Configure the port/pin as output pin/port using the GPIOx_PDDR (Port Data Direction Register).
To put the pin(s) HIGH, I can write a 1 bit/value to the GPIOx_PSOR (Port Set Output Register)
To put the pin(s) LOW, I can write a 1 bit/value to the GPIOx_PCOR (Port Clear Output Register)
To put the pin(s) either HIGH or LOW, I can write the bit/value into the GPIOx_PDOR (Port Data Output Register).

The following diagram shows the necessary port output register writes to create the WS2812 bit stream:

GPIO Output Register Writes

We could do this from the timer interrupts, but again this would be too slow. Instead, these port output register writes shall be triggered by DMA.

Configure the GPIO Port

On my board, I’m only using one lane/pin to the DIN of the WS2812B. I’m going to use PTD0 (PORT D, pin 0) for it:

Using PTD0 to DIN

The other three white wires are the pins of the three FTM channels connected to the logic analyzer.

So I need to extend my hardware initialization as below:

Line 4: enable clock gate for port D
Line 11: Mux PTD0 as GPIO
Line 12: Write the PDDR (Port Data Direction Register) with a 1 bit to use PTD0 as output pin.

static void InitHardware(void) {
  /* Enable clock for PORTs */
  SIM_HAL_EnableClock(SIM, kSimClockGatePortC);
  SIM_HAL_EnableClock(SIM, kSimClockGatePortD);

  /* Setup board clock source. */
  g_xtal0ClkFreq = 50000000U;           /* Value of the external crystal or oscillator clock frequency of the system oscillator (OSC) in Hz */
  g_xtalRtcClkFreq = 32768U;            /* Value of the external 32k crystal or oscillator clock frequency of the RTC in Hz */

  /* Use PTD0 as DIN to the Neopixels: mux it as GPIO and output pin */
  PORT_HAL_SetMuxMode(PORTD, 0UL, kPortMuxAsGpio); /* PTD0: DIN to NeoPixels */
  GPIO_PDDR_REG(PTD_BASE_PTR) |= (1&amp;lt;&amp;lt;0); /* PTD0 as output */

  /* FTM and FTM Muxing */
  InitFlexTimer(FTM0_IDX);
  PORT_HAL_SetMuxMode(PORTC,1UL,kPortMuxAlt4); /* use PTC1 for channel 0 of FTM0 */
  PORT_HAL_SetMuxMode(PORTC,2UL,kPortMuxAlt4); /* use PTC2 for channel 1 of FTM0 */
  PORT_HAL_SetMuxMode(PORTC,3UL,kPortMuxAlt4); /* use PTC3 for channel 2 of FTM0 */
}

You might notice that I’m using different APIs to do this.

PORT_HAL_SetMuxMode(PORTD, 0UL, kPortMuxAsGpio); /* PTD0: DIN to NeoPixels */

is a method of the Kinetis SDK. However

GPIO_PDDR_REG(PTD_BASE_PTR) |= (1<<0); /* PTD0 as output */

is using CMSIS-Core style direct register write. The Muxing is straight forward. However to set up a pin as output pin requires additional layers in the SDK with pin descriptors. To me, using the Kinetis SDK GPIO layers is overly complex in this example, so I simply use CMSIS register macros.

💡 I want to show here as well that mix-and-match of SDK with CMSIS is my view a good thing to balance ease-of-use and complexity.

With this, I have my GPIO pin configured. Now I need to write the port registers with DMA.

Direct Memory Access

As explained in the Concepts post, I need something very fast to write a GPIO port register. As the timing is around 0.3 μs, definitely too fast to use the CPU for this, especially if I want the CPU to do something else too. With DMA, the access to memory will be done without the CPU involvement, exactly what I need.

I’m using DMA on the FRDM-KL25Z board for things like reading ports in a DIY Logic Analyzer, or driving WS2812 pixels. The ARM Cortex-M4F microcontroller on the FRDM-K64F board has an eDMA (enhanced DMA) controller on it. It can use up to 16 independent DMA channels for DMA operations, with advanced source/and destination address calculations. That eDMA controller is described in the K64F Reference Manual.

eDMA Block Diagram (Source: Freescale K64F Reference Manual)

Data Path: the controller can read/write data from/to the crossbar switch. The crossbar provides access to memory and peripherals.
Address Path: This block is calculating the source and destination address. It does the calculation, plus any incrementing or decrementing of the address. For this it uses Transfer Control Descriptors (TCD).
Control and Channel Arbitration: This block is responsible to receive DMA requests from the supported request sources (e.g. from the timer module) and the write back flags to it (like telling the timer module that the DMA operation is done).
Transfer Control Descriptor: The descriptor is used to describe what shall be done in the DMA operations: how many bytes to read/write, source and destination address, what to do after the transfer, how many loops (inner and outer loops).

The basic DMA flow is the following: When a DMA peripheral request comes in, it will set the source and destination address using the TCD:

eDMA Operation, Part 1 (Source: Freescale K64F Reference Manual)

Using the source and destination address, the controller will do the read/write operation. Depending on the configuration in the TCD, this can be multiple source/destination read/writes with ‘minor’ and ‘major’ loop counters:

eDMA Operation, Part 2 (Source: Freescale K64F Reference Manual)

In the last step, the TCD is updated, e.g. address values are changed and flags get set. Additionally the peripheral who requested the DMA transfer gets informed that the operation is done:

eDMA operation, Part 3

Memory Considerations

Remember, I have three FTM channels. Each channel shall do trigger a GPIO Port operation:

FTM0 Channel 0: Write ‘1’ to PSOR to set DIN to HIGH.
FTM0 Channel 1: Write data bit to PDOR to either keep DIN HIGH (‘1’ WS2812 bit) or to put DIN LOW (‘0’ WS2812 bit).
FTM0 Channel 2: Write ‘1’ to PCOR to set DIN to LOW.

This needs to be done for each WS2812 bit, and the number of bits is given by the number of WS2812 LEDs (24 bits for each), and the bits are stored in a buffer:

#define NEO_NOF_PIXEL       (8*8) /* Adafruit 8x8 matrix */
#define NEO_NOF_BITS_PIXEL   (24) /* 24 bits for pixel */
static uint8_t transmitBuf[NEO_NOF_PIXEL*NEO_NOF_BITS_PIXEL];

Remember, that only the least-significant-bit is used in each byte, as I’m only using a single lane of WS2812.

💡 If I would use 8 lanes (e.g. 8 NeoPixel Matrix displays, each connected to a single port pin, PTD0 to PTD7) then I would use every bit of the byte. I need 3 bytes of memory for each WS2812 pixel.

Triggering DMA Requests

To enable DMA requests from my FTM channels, I need to carefully read the reference manual:

FTM DMA Request

What is confusing to me is that two settings (DMA=0|CHnIE=0 and DMA=1|CHnIE=0) are doing the same? First I thought that this must be a copy-paste error in the manual. But without enabling the ‘Interrupt Enable’ (CHnIE) bit the DMA was not working :-(. So it seems that really both bits have to set. And this was what I had to do in my FTM initialization/reset routine:

static void ResetFTM(uint32_t instance) {
  FTM_Type *ftmBase = g_ftmBase[instance];
  uint8_t channel;

  /* reset all values */
  FTM_HAL_SetCounter(ftmBase, 0); /* reset FTM counter */
  FTM_HAL_ClearTimerOverflow(ftmBase); /* clear timer overflow flag (if any) */
  for(channel=0; channel&amp;lt;NOF_FTM_CHANNELS; channel++) {
    FTM_HAL_ClearChnEventFlag(ftmBase, channel); /* clear channel flag */
    FTM_HAL_SetChnDmaCmd(ftmBase, channel, true); /* enable DMA request */
    FTM_HAL_EnableChnInt(ftmBase, channel); /* enable channel interrupt: need to have both DMA and CHnIE set for DMA transfers! See RM 40.4.23 */
  }
}

DMA Driver Initialization

Time to initialize the DMA driver of the SDK. Because of the complexity of eDMA, I’m using again a mixture of Kinetis SDK API and Kinetis SDK HAL API. The initialization of the DMA I do with the SDK API:

static void InitDMADriver(void) {
  edma_user_config_t  edmaUserConfig;
  static edma_state_t edmaState;
  uint8_t res, channel;

  /* Initialize eDMA modules. */
  edmaUserConfig.chnArbitration = kEDMAChnArbitrationRoundrobin; /* use round-robin arbitration */
  edmaUserConfig.notHaltOnError = false; /* do not halt in case of errors */
  EDMA_DRV_Init(&amp;amp;edmaState, &amp;amp;edmaUserConfig); /* initialize DMA with configuration */
}

The initialization is rather simple: I set the DMA channel arbitration (priority scheduling) to Round-Robin. This means that the DMA will execute one channel after each other, and not use the DMA channel priority mechanism. As I have a fixed sequence of timer channel events, I keep it simple and use round-robin. With noHaltOnError I specify that the device should not halt in case of errors, this is again to keep things simple.

I initialize the DMA Driver as part of my hardware initialization:

static void InitHardware(void) {
  /* Enable clock for PORTs */
  SIM_HAL_EnableClock(SIM, kSimClockGatePortC);
  SIM_HAL_EnableClock(SIM, kSimClockGatePortD);

  /* Setup board clock source. */
  g_xtal0ClkFreq = 50000000U;           /* Value of the external crystal or oscillator clock frequency of the system oscillator (OSC) in Hz */
  g_xtalRtcClkFreq = 32768U;            /* Value of the external 32k crystal or oscillator clock frequency of the RTC in Hz */

  /* Use PTD0 as DIN to the Neopixels: mux it as GPIO and output pin */
  PORT_HAL_SetMuxMode(PORTD, 0UL, kPortMuxAsGpio); /* PTD0: DIN to NeoPixels */
  GPIO_PDDR_REG(PTD_BASE_PTR) |= (1&amp;lt;&amp;lt;0); /* PTD0 as output */

  /* FTM and FTM Muxing */
  InitFlexTimer(FTM0_IDX);
  PORT_HAL_SetMuxMode(PORTC,1UL,kPortMuxAlt4); /* use PTC1 for channel 0 of FTM0 */
  PORT_HAL_SetMuxMode(PORTC,2UL,kPortMuxAlt4); /* use PTC2 for channel 1 of FTM0 */
  PORT_HAL_SetMuxMode(PORTC,3UL,kPortMuxAlt4); /* use PTC3 for channel 2 of FTM0 */

  InitDMADriver(); /* initialize DMA driver */
}

Transfer the Bits the DMA

So far I have everything set up:

FTM timer is generating the needed signals, with DMA triggering enabled
GPIO for the DIN to the LED is ready
eDMA driver is initialized

Now I can start a DMA transfer, and I use the following method:

void DMA_Transfer(uint8_t *transmitBuf, uint32_t nofBytes);

Remember, that I have a buffer with the bits for the WS2812 LEDs. In order to send the bits to the PTD0, I can use

DMA_Transfer(transmitBuf, sizeof(transmitBuf));

DMA Transfer

I’m going to use three DMA channels, one for each timer channel. In order to transmit the bits with DMA in DMA_Transfer(), I do the following:

Reset FTM: reset the timer registers. The FTM is not clocked at this point.
DMA Muxing: Request three DMA channels for FTM0 channel 1, 2 and 3
Install callback: install an ‘End of Transfer’ interrupt handler for DMA channel 3. That way I get notified when the transfer of all bits is over.
Setup the DMA TCD: Setting up the Transfer Control Descriptor with source/destination for the DMA channel.
Start/Enable all DMA channels: this turns on/enables the DMA channels.
Start the FTM: initialize a ‘dmaDone’ flag and turning on the clocks to the FTM, letting the timer run.
Wait until DMA is done: the ‘end of transfer interrupt’ will set the ‘dmaDone’ flag.
Turn off FTM: remove the clock from the FTM timer.
Disable/stop all DMA channels.
De-Mux and de-install DMA channels.

💡 You might wonder why I’m doing the Muxing and De-Muxing for every transfer (step 2 and 10)? The answer is (I believe) that the there are internal propagation delays inside the DMA controller. Muxing and De-Muxing the DMA ensures that the DMA controller is resetting its internal registers. I had to learn this the hard way: DMA worked fine at lower speed (say 1 ms DMA frequencies), as there was enough time and clocking inside the module to get it into the correct state. But using the DMA in the sub μs time domain as I’m using it here definitely showed some strange DMA behaviour with ‘ghost’ DMA transfers. I already had these strange things happening on the FRDM-KL25Z, see “NeoShield: WS2812 RGB LED Shield with DMA and nRF24L01+“.

The following is the full routine, I will discuss some of the details

/* DMA related */
#define NOF_EDMA_CHANNELS  3 /* using three DMA channels */
static edma_chn_state_t chnStates[NOF_EDMA_CHANNELS]; /* array of DMA channel states */
static volatile bool dmaDone = false; /* set by DMA complete interrupt on DMA channel 3 */
static const uint8_t OneValue = 0xFF; /* value to clear or set the port bits */

void DMA_Transfer(uint8_t *transmitBuf, uint32_t nofBytes) {
  edma_transfer_config_t config;
  uint8_t channel;
  uint8_t res;

  ResetFTM(FTM0_IDX); /* clear FTFM and prepare for DMA */

  /* DMA Muxing: Allocate EDMA channel request trough DMAMUX */
  for (channel=0; channel&amp;lt;NOF_EDMA_CHANNELS; channel++) {
    res = EDMA_DRV_RequestChannel(channel, kDmaRequestMux0FTM0Channel0+channel, &amp;amp;chnStates[channel]);
    if (res==kEDMAInvalidChannel) { /* check error code */
      for(;;); /* ups!?! */
    }
  }
  /* Install callback for eDMA handler on last channel which is channel 2 */
  EDMA_DRV_InstallCallback(&amp;amp;chnStates[NOF_EDMA_CHANNELS-1], EDMA_Callback, NULL);

  /* prepare DMA configuration */
  config.srcLastAddrAdjust = 0; /* no address adjustment needed after last transfer */
  config.destLastAddrAdjust = 0; /* no address adjustment needed after last transfer */
  config.srcModulo = kEDMAModuloDisable; /* no address modulo (no ring buffer) */
  config.destModulo = kEDMAModuloDisable; /* no address modulo (no ring buffer) */
  config.srcTransferSize = kEDMATransferSize_1Bytes; /* transmitting one byte in each DMA transfer */
  config.destTransferSize = kEDMATransferSize_1Bytes; /* transmitting one byte in each DMA transfer */
  config.minorLoopCount = 1; /* one byte transmitted for each request */
  config.majorLoopCount = nofBytes; /* total number of bytes to send */
  config.destOffset = 0; /* do not increment destination address */

  config.srcAddr = (uint32_t)&amp;amp;OneValue; /* Bit set */
  config.destAddr = (uint32_t)&amp;amp;GPIO_PSOR_REG(PTD_BASE_PTR); /* Port Set Output register */
  config.srcOffset = 0; /* do not increment source address */
  PushDMADescriptor(&amp;amp;config, &amp;amp;chnStates[0], false); /* write configuration to DMA channel 0 */

  config.srcAddr = (uint32_t)transmitBuf; /* pointer to data */
  config.destAddr = (uint32_t)&amp;amp;GPIO_PDOR_REG(PTD_BASE_PTR); /* Port Data Output register */
  config.srcOffset = 1; /* do not increment source address */
  PushDMADescriptor(&amp;amp;config, &amp;amp;chnStates[1], false); /* write configuration to DMA channel 1 */

  config.srcAddr = (uint32_t)&amp;amp;OneValue; /* Bit set */
  config.destAddr = (uint32_t)&amp;amp;GPIO_PCOR_REG(PTD_BASE_PTR); /* Port Clear Output register */
  config.srcOffset = 0; /* do not increment source address */
  PushDMADescriptor(&amp;amp;config, &amp;amp;chnStates[2], true); /* write configuration to DMA channel 1 */

  /* enable the DMA channels */
  for (channel=0; channel&amp;lt;NOF_EDMA_CHANNELS; channel++) {
    EDMA_DRV_StartChannel(&amp;amp;chnStates[channel]); /* enable DMA */
  }
  dmaDone = false; /* reset done flag */
  StartStopFTM(FTM0_IDX, true); /* start FTM timer to fire sequence of DMA transfers */
  do {
    /* wait until transfer is complete */
  } while(!dmaDone);
  StopFTMDMA(FTM0_IDX); /* stop FTM DMA tranfers */
  for (channel=0; channel&amp;lt;NOF_EDMA_CHANNELS; channel++) {
    EDMA_DRV_StopChannel(&amp;amp;chnStates[channel]); /* stop DMA channel */
  }
  /* Release EDMA channel request trough DMAMUX, otherwise events might still be latched! */
  for (channel=0; channel&amp;lt;NOF_EDMA_CHANNELS; channel++) {
    res = EDMA_DRV_ReleaseChannel(&amp;amp;chnStates[channel]);
    if (res!=kStatus_EDMA_Success) { /* check error code */
      for(;;); /* ups!?! */
    }
  }
}

One important part is the configuration of the TCD (Transfer Control Descriptor). I setup three descriptors, one for each DMA channel:

Channel 0: Writing a ‘1’ to the PSOR (Port Set Output) register.
Channel 1: Writing the data bit to the PDOR (Port Data Output) register.
Channel 2: Writing a ‘1’ to the PCOR (Port Clear Output) register.

The descriptors have several fields to configure the DMA transfer. Basically what I describe for the DMA transfers is “take this byte from this source address and write it to this destination address”. In addition I specify “how many bytes to read/write” and if some address calculations shall be performed for the source and destination address. In the next sections I explain the different settings:

In the eDMA it is possible to make a special adjustment at the end of the last transfer: as I do not need this for the WS2812, that setting is an offset of zero:

config.srcLastAddrAdjust = 0; /* no address adjustment needed after last transfer */
config.destLastAddrAdjust = 0; /* no address adjustment needed after last transfer */

The DMA address calculation can be configured to ‘wrap-around’ e.g. if using a ring buffer: I have it disabled as I do not need that functionality:

config.srcModulo = kEDMAModuloDisable; /* no address modulo (no ring buffer) */
config.destModulo = kEDMAModuloDisable; /* no address modulo (no ring buffer) */

The next setting is to specify how many bytes have to be transmitted in a single DMA transfer: I only need to write a single byte to the GPIO port:

config.srcTransferSize = kEDMATransferSize_1Bytes; /* transmitting one byte in each DMA transfer */
config.destTransferSize = kEDMATransferSize_1Bytes; /* transmitting one byte in each DMA transfer */

In the next setting I can specify the ‘minor’ and ‘major’ loop: that way I can ‘nest’ the DMA operations:

eDMA Multiple Loop Interation

In my case I only need to write a single byte for each DMA request, so the minor loop counter is ‘1’. However, I need to write multiple bytes for the DMA operation (to write all bytes of the transmitBuf[], therefore the majorLoopCount is the total number of bytes:

  config.minorLoopCount = 1; /* one byte transmitted for each request */
  config.majorLoopCount = nofBytes; /* total number of bytes to send */

The next setting is to specify what should happen with the destination address. The destination address will be the GPIO port address, so no need to change this.

  config.destOffset = 0; /* do not increment destination address */

The above settings are all the same for all three DMA channels. What follows are the special settings to be used for each DMA channel.

DNA channel zero will create a raising edge of the DIN WS2812 signal. To be executed by the CPU, I would write it like this:

static const uint8_t OneValue = 0x01; /* value to clear or set the port bits */

GPIO_PSOR_REG(PTD_BASE_PTR) = OneValue:

Translated to the DMA descriptor it is this:

  config.srcAddr = (uint32_t)&amp;amp;OneValue; /* Bit set */
  config.destAddr = (uint32_t)&amp;amp;GPIO_PSOR_REG(PTD_BASE_PTR); /* Port Set Output register */
  config.srcOffset = 0; /* do not increment source address */

Next is DMA channel 1 which will write the data bit. In ‘normal’ code it would be this:

static const uint8_t OneValue = 0x01; /* value to clear or set the port bits */

GPIO_PDOR_REG(PTD_BASE_PTR) = *transmitBuf; transmitBuf++;

In ‘DMA language’ it is this:

  config.srcAddr = (uint32_t)transmitBuf; /* pointer to data */
  config.destAddr = (uint32_t)&amp;amp;GPIO_PDOR_REG(PTD_BASE_PTR); /* Port Data Output register */
  config.srcOffset = 1; /* increment source address */

Lastly, like for DMA channel 0 the channel 2 writes a one to the GPIO register:

  config.srcAddr = (uint32_t)&amp;amp;OneValue; /* Bit set */
  config.destAddr = (uint32_t)&amp;amp;GPIO_PCOR_REG(PTD_BASE_PTR); /* Port Clear Output register */
  config.srcOffset = 0; /* do not increment source address */

Each of the Descriptors is written to the hardware registers with this custom routine:

static void PushDMADescriptor(edma_transfer_config_t *config, edma_chn_state_t *chn, bool enableInt) {
  /* If only one TCD is required, only hardware TCD is required and user
   * is not required to prepare the software TCD memory. */
  edma_software_tcd_t temp[2]; /* make it larger so we can have a 32byte aligned address into it */
  edma_software_tcd_t *tempTCD = STCD_ADDR(temp); /* ensure that we have a 32byte aligned address */

  memset((void*) tempTCD, 0, sizeof(edma_software_tcd_t)); /* initialize temporary descriptor with zeros */
  EDMA_DRV_PrepareDescriptorTransfer(chn, tempTCD, config, enableInt, true); /* prepare and copy descriptor into temporary one */
  EDMA_DRV_PushDescriptorToReg(chn, tempTCD); /* write EDMA registers */
}

Did you notice that temp[2] variable? This is necessary to align the TCD to a 32 byte boundary. If the address of the TCD is not aligned to that boundary, a hard fault will happen :-(. So this routine allocates twice the amount of stack space, and the STCD_ADDR macro will point into that stack space and ensures it is 32byte aligned.

💡 WARNING: The EDMA_DRV_ConfigLoopTransfer() function in the Kinetis SDK v1.2 might create a hard fault, because it does not do that special alignment.

DMA channel 0 and 1 are configured not to create any interrupts. Only channel 2 is configured with the third parameter to raise an interrupt at the end of the ‘major’ iteration (when all bytes are transmitted):

  PushDMADescriptor(&amp;amp;config, &amp;amp;chnStates[2], true); /* write configuration to DMA channel 1, and enable 'end' interrupt for it */

So I have to add a handler for DMA interrupt on channel 2, otherwise my application will end up in an unhandled interrupt. DMA2_IRQHandler() is the interrupt handler, and EDMA_DRV_IRQHandler() will call the callback EDMA_Callback():

/*! @brief Dma channel 2 ISR */
void DMA2_IRQHandler(void){
   EDMA_DRV_IRQHandler(2U); /* call SDK EDMA IRQ handler, this will call EDMA_Callback() */
}

void EDMA_Callback(void *param, edma_chn_status_t chanStatus) {
  (void)param; /* not used */
  (void)chanStatus; /* not used */
  dmaDone = true; /* set 'done' flag at the end of the major loop */
}

That handler I have to install with

  /* Install callback for eDMA handler on last channel which is channel 2 */
  EDMA_DRV_InstallCallback(&amp;amp;chnStates[NOF_EDMA_CHANNELS-1], EDMA_Callback, NULL);

With all the TCD settings pushed to the DMA channels, it is time to enable all the channels:

  /* enable the DMA channels */
  for (channel=0; channel&amp;lt;NOF_EDMA_CHANNELS; channel++) {
    EDMA_DRV_StartChannel(&amp;amp;chnStates[channel]); /* enable DMA */
  }

Then I reset the ‘done’ flag, start the FTM timer and wait until the transfer is done:

  dmaDone = false; /* reset done flag */
  StartStopFTM(FTM0_IDX, true); /* start FTM timer to fire sequence of DMA transfers */
  do {
    /* wait until transfer is complete */
  } while(!dmaDone);

After all bytes are sent, I stop the FTM timer, disable the channels and release the DMA channels:

  StopFTMDMA(FTM0_IDX); /* stop FTM DMA transfers */
  for (channel=0; channel&amp;lt;NOF_EDMA_CHANNELS; channel++) {
    EDMA_DRV_StopChannel(&amp;amp;chnStates[channel]); /* stop DMA channel */
  }
  /* Release EDMA channel request trough DMAMUX, otherwise events might still be latched! */
  for (channel=0; channel&amp;lt;NOF_EDMA_CHANNELS; channel++) {
    res = EDMA_DRV_ReleaseChannel(&amp;amp;chnStates[channel]);
    if (res!=kStatus_EDMA_Success) { /* check error code */
      for(;;); /* ups!?! */
    }
  }

This completes the DMA transfer, and things can start over again with the next transfer.

“Wonderful and Colorful Things”

Time to try things out. The following program writes three WS2812B pixels: green, red and blue:

#include &amp;quot;fsl_device_registers.h&amp;quot;
#include &amp;quot;DMAPixel.h&amp;quot;

#define NEO_NOF_PIXEL       3
#define NEO_NOF_BITS_PIXEL 24
static uint8_t transmitBuf[NEO_NOF_PIXEL*NEO_NOF_BITS_PIXEL] =
    {
        /* pixel 0: */
        1, 1, 1, 1, 1, 1, 1, 1, /* green */
        0, 0, 0, 0, 0, 0, 0, 0, /* red */
        0, 0, 0, 0, 0, 0, 0, 0, /* blue */
        /* pixel 1: */
        0, 0, 0, 0, 0, 0, 0, 0, /* green */
        1, 1, 1, 1, 1, 1, 1, 1, /* red */
        0, 0, 0, 0, 0, 0, 0, 0,  /* blue */
        /* pixel 0: */
        0, 0, 0, 0, 0, 0, 0, 0, /* green */
        0, 0, 0, 0, 0, 0, 0, 0, /* red */
        1, 1, 1, 1, 1, 1, 1, 1  /* blue */
    };

int main(void) {
  uint8_t red, green, blue;

  DMA_Init();
  for (;;) {
    DMA_Transfer(transmitBuf, sizeof(transmitBuf));
  }
  /* Never leave main */
  return 0;
}

Checking with the logic analyzer I can see that it takes 91.1 μs to send the data:

Timing to transmit data for three WS2812

The following zooms into the first 8 bits (green) sent:

first 8 green bits

I can see as well the delay between the timer/DMA event and the time until the port bit actually has changed: it is around 0.2 μs:

DMA to GPIO Delay

But the Timing for the ‘1’ and ‘0’ bits are within the specification :-):

WS2812 Bit 1 Timing

WS2812 Bit 0 Timing

And voilà, this is what I get on the NeoPixel Matrix: the first three LED’s in Green, Red and Blue :-):

Red, Green and Blue Color Pixels

Summary

I have now FTM with DMA working, and it bangs the bit out of the GPIO port, in one or multiple lanes. I’m using only one lane now, but it works the same way with multiple lanes. With the 128 KByte of RAM the number of WS2812 pixels I can drive now is huge: I need 24 bytes per pixel if I’m using a single lane. So for a 8×8 matrix I need 1536 bytes, but if I use eight 8×8 Boards with 8 lanes (PTD0 to PTD7), I only need that 3 bytes per pixel: 1536 bytes too 🙂

💡 I could pack all the 24 bits for pixel into three bytes and then make a multi-stage DMA transfer: unpack the bits and send it to the port. I have not thought that through, but maybe this would be something doable to reduce the amount of RAM needed for a single lane configuration.

This project uses DMA on a Freescale Kinetis device, and I tried my best to explain the approach used here. Still, there are a lot more features and possibilities with DMA. It takes some time to get familiar with DMA, but the capabilities are amazing :-).

I had to use a mixture of Freescale Kinetis SDK API, SDK HAL API and CMSIS register access macros. Freescale is promoting the Kinetis SDK, but this project again confirmed to me that the SDK alone does not cover all the needs of developing embedded applications: I still need CMSIS register access API. On the other side: there are some nice routines in the SDK and especially the HAL layer which makes things easier to use. But again as with everything: it takes time to learn all these things. And I hope that this article series can help you with that learning process.

The project sources are on GitHub here:
https://github.com/ErichStyger/mcuoneclipse/tree/master/Examples/KDS/FRDM-K64F120M/FRDM-K64F_NeoPixel_SDK

So, what could be next? I could describe/develop a ‘graphics’ driver for the WS2812 pixels? Or maybe that is something I leave to Manya? Post comments and let me know what you think :-).

Happy DMAing 🙂

33 thoughts on “Tutorial: Adafruit WS2812B NeoPixels with the Freescale FRDM-K64F Board – Part 5: DMA”

Pingback: Tutorial: Adafruit WS2812B NeoPixels with the Freescale FRDM-K64F Board – Part 1: Hardware | MCU on Eclipse
Pingback: Tutorial: Adafruit WS2812B NeoPixels with the Freescale FRDM-K64F Board – Part 4: Timer | MCU on Eclipse
Pingback: Tutorial: Adafruit WS2812B NeoPixels with the Freescale FRDM-K64F Board – Part 3: Concepts | MCU on Eclipse
Pingback: Tutorial: Adafruit WS2812B NeoPixels with the Freescale FRDM-K64F Board – Part 2: Software Tools | MCU on Eclipse
dsgmedia on August 5, 2015 at 08:20 said:

Hi,

A wonderful thing will be to have a DMA bit shift operation.

So, I was thinking that a callback after the second DMA where the current byte is right shifted one bit will take care of the RAM usage, it will be also easy to implement the graphic diver.

Still a lot of interrupts on the system side but let’s say with at 30fps there will be a burst of interrupts every 33ms for as long as the strip length * 30us. So for 100pixels it is around 3ms. So there is enough time for other operations.

And thank you for the DMA explanation. It was a nice refresh as I had to write about 5 years ago the DMA driver for a Freescale controller and I had a lot of fun with the 32byte alignment.

MFG.

LikeLike

Reply ↓
- Erich Styger on August 5, 2015 at 09:14 said:
  
  Hi,
  yes, a bitshift operation would be awesome! The other thought I had was to consider using bitbanding (I have used that on Cortex-M3, but not on Cortex-M4). The problem with doing the bit shifting by the CPU is that it will take a considerable amount of CPU instructions: load, shift, store will take time. So while this probably is doable, it will increase the CPU load to reduce the amount of RAM needed. My counter argument to this is: if I need to optimize the amount of memory, then I very likely have lot of WS2812 LEDs. If I have a lot, I better organize them in lanes (say 8 lanes): using my GPIO+DMA approach will then use just one bit for each bit, so using the minimal memory. And it will not load the CPU with shift operations :-).
  And you are welcome about the DMA operation. The eDMA controller is a big and complex thing, it took me a while to learn it. And yes: I have run badly into that 32byte alignment problem too :-(.
  
  LikeLike
  
  Reply ↓
gasstationwithoutpumps on August 5, 2015 at 08:28 said:

Wouldn’t it be simpler to have the line be handled like PWM and have the DMA on the overflow just transfer to the C0V register the duration of the high pulse (either 350ns or 900ns)? That would only take one channel, not three, though it takes a full byte per bit, not sharable across multiple outputs.

LikeLike

Reply ↓
- Erich Styger on August 5, 2015 at 09:01 said:
  
  Good point :-). I did exactly this in my earlier version (see https://mcuoneclipse.com/2014/07/13/first-adafruit-neopixel-blinks-with-the-frdm-board/) on the FRDM-KL25Z. The issue with this approach is that it needs 16bits for each bit, so doubling the memory requirements. I need to write 16 bits to the PWM C0V register (I cannot only write the lower 8bits, it needs a 16bit write to latch the register). Additionally, I’m limited by the PWM pin: only few pins can be routed as timer output/PWM. Using a GPIO is much more universal, and I can have multiple lanes. I have 16 DMA channels, and only using three, and only during sending the bits. Overall, to me the version with GPIO and three DMA channels is more scalable and needs less RAM/memory.
  
  LikeLike
  
  Reply ↓
  - gasstationwithoutpumps on August 5, 2015 at 16:21 said:
    
    I didn’t realize that the C0V register could not be written with a 1-byte transfer—that does make a big difference. I see that the reference manual does require that all bytes of CnV be written at the same time though.
    
    LikeLike
    
    Reply ↓
  - gasstationwithoutpumps on August 5, 2015 at 16:42 said:
    
    Looking over your code again, I see that the same trick can be used on the FTM (or the TPM on the KL25Z) to get edge-aligned PWM on any GPIO channel, at a cost of using two timer channels and 2 DMA channels instead of just 1 timer channel and no DMA for the hard-wired PWM selections. Of course the KL25Z only has 4 DMA channels, so this is not very useful as a general PWM solution.
    
    LikeLike
    
    Reply ↓
Michal on March 3, 2016 at 22:29 said:

Hello Erich,
Can You create in future one example with transfer data over SPI with DMA (on KL16 or KL26)? It would be very helpful for OLED display or external SPI Flash memory 🙂

LikeLike

Reply ↓
Tisham Dhar (@whatnick) on June 7, 2016 at 04:14 said:

Hi, I am aiming to migrate this to the K82F for my dice project. Do you have a migration pathway guideline ? https://www.hackster.io/whatnick/smart-dice-the-physical-digital-rng-18ee03

LikeLike

Reply ↓
- Erich Styger on June 7, 2016 at 08:00 said:
  
  Hi Tisham,
  have a look at https://mcuoneclipse.com/2016/05/22/nxp-flexio-generator-for-the-ws2812b-led-stripe-protocol/
  Erich
  
  LikeLike
  
  Reply ↓
Matthias L. Jugel (@thinkberg) on July 16, 2016 at 11:18 said:

I ported the example to the K82F (KSDK 2.0) after doing ups and downs between old SDK, Hardware and new SDK. Unfortunately I can’t use the FlexIO variant, which I would have preferred, but this needs some changes on the board wiring (next version maybe).

Erich: When the DMA writes to the port registers, it overwrites any settings there already, do you know if this only applies to the byte we write or the whole 32 bit?

I currently only need to drive two or maybe 4 RGB leds, so its a very short time the bits are banged. I was wondering if there is a way to write the GPIO masked as to not toggle other bits.

Tutorial: Adafruit WS2812B NeoPixels with the Freescale FRDM-K64F Board – Part 5: DMA

LikeLike

Reply ↓
- Erich Styger on July 16, 2016 at 11:57 said:
  
  Hi Matthias,
  the GPIO module has special registers to clear and to set bits: that way only the needed bits are cleared/set. These are 32bit registers, but the other bits are not affected. And normal bit banging will not work as it will be hard to keep the timing.
  Erich
  
  LikeLike
  
  Reply ↓
  - Matthias L. Jugel (@thinkberg) on July 16, 2016 at 12:00 said:
    
    Oh, you’re right of course. What had taken a while for me to realise though was your use of PTB0 which is the 0x1. I learned a lot in the process 🙂 Thanks!
    
    LikeLike
    
    Reply ↓
  - Matthias L. Jugel (@thinkberg) on July 16, 2016 at 12:54 said:
    
    With the bit banging I was referring to the DMA, not doing it by hand.
    
    Also regarding, the FTM, wouldn’t it be possible to write the duty cycles via DMA to the FTM to create the necessary timing? If I set an FTM to have a 4000us period and then change the duty cycles accordingly to have 0, 1 or low for latching?
    
    LikeLike
    
    Reply ↓
  - Matthias L. Jugel (@thinkberg) on July 16, 2016 at 13:59 said:
    
    I have to think more about all that, I had not read all comments…
    
    LikeLike
    
    Reply ↓
Matthias L. Jugel (@thinkberg) on July 16, 2016 at 11:22 said:

I posted the wrong link: https://gist.github.com/thinkberg

LikeLike

Reply ↓
Matthias L. Jugel (@thinkberg) on July 16, 2016 at 14:20 said:

One more question:

You allocate the TCDs in one subroutine: https://github.com/ErichStyger/mcuoneclipse/blob/master/Examples/KDS/FRDM-K64F120M/FRDM-K64F_NeoPixel_SDK/Sources/DMAPixel.c#L138

Isn’t this an issue after leaving the function and a potential stack corruption?

LikeLike

Reply ↓
- Erich Styger on July 18, 2016 at 13:23 said:
  
  Hi Matthias,
  no, this should not be an isssue. That pointer points to a struct, and the struct values are used in teh DMA descriptor, not the address. So the values are used, not the address.
  
  LikeLike
  
  Reply ↓
Matthias L. Jugel (@thinkberg) on July 17, 2016 at 13:01 said:

Well, I’ve updated the gist. One issue still standing, after each transfer, it looks like there is a bit stream that sets the first LED to green. I got two LEDs and even if I transfer just zeros, it will turn green. Well, tomorrow probably, the Saleae is arriving so I can have a look at what happens there.

LikeLike

Reply ↓
- Erich Styger on July 18, 2016 at 08:29 said:
  
  Hi Mattias,
  if the first LED gets some data, most likely the timing is not correct. The logic analyzer will hopefully show you the problem. Additionally, make sure you use a good and fast level shifter to 5V: slow rise/fall time are problematic.
  Erich
  
  LikeLike
  
  Reply ↓
  - Matthias L. Jugel (@thinkberg) on July 18, 2016 at 13:30 said:
    
    I think power-wise I am safe. However, it looks like there are two issues, first, the length of the pulses seem to long (2.52us) and second the first two bits of the second transmission look strange, a longer 1.56us pulse, then a longer low then usual (0.4us) and then a very short pulse (0.2us). The latter may be related to an unclean reset of the whole procedure.
    
    I guess I will first check the overall timing. Some of the calculations are done underneath in the KSDK 2.0.
    
    Btw, you don’t have a student that would be interested in doing an internship or even more, working on a secure IoT platform. Coming to Berlin would be awesome too 🙂
    
    LikeLike
    
    Reply ↓
    - Erich Styger on July 18, 2016 at 14:51 said:
      
      There is right now the semester end/break/exam session. But I can ask around.
      
      LikeLike
  - Matthias L. Jugel (@thinkberg) on July 20, 2016 at 12:03 said:
    
    I got it working, but did not find the issue of the glitch. I fixed it for now by disabling the RGB DIN gpio before EDMA_StartTransfer and enabling it right after. It’s really strange.
    
    LikeLike
    
    Reply ↓
    - Erich Styger on July 20, 2016 at 16:27 said:
      
      Hi Matthias,
      I have seen something similar (see section “DMA Transfer”, point 10): I had to de-mux the channels. Problably it is related to what you have seen?
      
      LikeLike
  - Simon Haines on September 9, 2016 at 09:08 said:
    
    I had this too. The DMA transfer was starting too early because the TCD’s CSR[START] bit was set (not sure what this maps to in the SDK). This was raising the data line before the FlexTimer started triggering transfers, and causing the high bit of the green pulse train to be set. Clearly visible on a logic probe.
    
    LikeLike
    
    Reply ↓
Simon Haines on September 9, 2016 at 08:58 said:

I recently implemented this technique on a Kinetis MK20FX (which does not have a SDK), for a bank of 8 LEDs and would like to share my findings for anyone in a similar situation. First off, I’d like to give a big thanks to Erich for this, otherwise I would still be suspending interrupts (for a very long time) and bit-banging the LED data line, which always gave inconsistent results especially under temperature variations.

Without a SDK there is a lot of tedious tweaking of the DMA channel configuration. For each of the three configured DMA channels, you’ll want to set the CSR[DREQ] bit to disable requests at the end of the major loop, otherwise you’ll only get one DMA major loop (clearing the ERQ[channel] bit at the end of the major loop is too late). Don’t be confused by the CSR[START] bit, you will not want to set this but rather let the FlexTimer initiate the DMA request. Also in the DMA IRQ handler for the third channel, you will want to load the CINT[channel] register with your third channel number and write to the CDNE[CADN] bit to clear the ‘done’ status of each DMA channel. If you are using a different timer (in my case, FlexTimer 3), pay attention to the DMAMUX banks and ensure you are multiplexing the FlexTimer sources in the right way. For example I used FTM3 channels 1-3 which have source identifiers 26-28 in DMAMUX1 which in turn are mapped to DMA channels 16-18.

Finally my board has multiple GPIO peripherals on the same port as the LED data line, so simply writing a value to the port’s PDOR register could clobber the other peripherals co-existing on the port. My initial approach was to use bitband aliasing, but the DMA engine cannot access that part of memory and produced destination bus errors. As per this example, I also used a 8 * 24 byte buffer (with one byte for each bit to be written to the 8 LEDs), but pre-shifted the values to align to the pin number of the data line. For example, my LED data pin is port C.2 so each byte (which represents a single bit) was shifted left twice. Additionally, the bit was flipped so that the array contained a 0 byte for each 1 bit and a 1 byte for each 0 bit (before shifting). With this in place, I could simply replace the destination address of the second DMA channel with the address of port C’s PCOR register. Now when a 1 byte is transferred, the C.2 pin is cleared but the other peripherals remain untouched. I suppose you don’t have to flip the bit and can use the PSOR destination instead, but I did not test that.

I am happy to share my code but won’t post it in this comment which is already too long. Once again thanks for this, I could never have conceived of this approach–let alone implemented it–without the great help of these articles.

LikeLike

Reply ↓
Pingback: Driving 16 WS2812B Strips with GPIOs and DMA | Hackaday
Quinn Halpin on May 13, 2017 at 21:06 said:

Hi, I have been trying to follow your tutorial and I am struggling a lot because I can get the number of pixels I want to turn on, but they don’t turn off and they are all white even if I make the color red or blue. Do you have any ideas of what would cause this? I don’t have the sensor to check the timing. I have tried many things (changing the wires, adding a capacitor in parallel, powering externally). I am very lost, and any help is much appreciated.

LikeLike

Reply ↓
- Erich Styger on May 14, 2017 at 07:10 said:
  
  probably your timing is wrong. You need to hook up a logic analyzer or oscilloscope to the data line to see what is actually transmitted. Then you know what the problem is.
  
  LikeLike
  
  Reply ↓
Pingback: Behind the Canvas: Making of “60 Billion Lights” | MCU on Eclipse