I've done some experiments to try to locate where the delay is and it does seem to just be reading the source data is slow. If I reduce the SPI bus speed the length of the delay (in spi clocks) between bytes is shorter. At 10MHz there are ~24 clocks of idle but with 1MHz there are only ~4 clocks of idle. The DMA controller in the RZ/A1 can apparently read a long from the source in one transaction and feed it too the destination as 4 byte writes so I'm thinking maybe I can setup the DMA controller to read 2 longs and write that as 8 bytes in one go so that each DMA transaction fills the SPI controller's FIFO.