Peripheral access to system memory without going through a processor. The processor interfaces with a DMA controller which generates an interrupt when the peripheral operation is complete. The operation may be requested by the host or execute asynchronously by the peripheral.

The core problem that DMA is trying to solve is bus management. In the classic example the DMA controller facilitates ownership of the bus. Really the notion of a DMA controller is outdated and the real scope of the problem is bus design.

In a bus mastering system or burst mode the peripheral is granted control of the memory bus for some period. Alternatively, in cycle stealing mode the DMA controller asks the processor for each block of data that the peripheral device wants to write.

DMA controllers may also provide scatter/gather functionality where a buffer scattered in memory appears contiguous to a device. This reduces the number of operations that the processor needs to perform.

They may also support memory-to-memory operations.

In PCI there is no physical DMA controller or third-party DMA. Instead each device may become a bus master (i.e. first-party DMA). Multi-master configurations are less deterministic and require arbitration.

Types of addresses

User space processes each have their own virtual memory. The kernel deals directly with the physical addresses of the processor. Except in small processors, a memory management unit (MMU) does the translation between virtual and physical addresses.

The kernel technically has it’s own address space known as logical addresses, which are normally just offset from physical addresses. Additionally, there are kernel virtual addresses, which includes logical addresses.

Why is virtual address space split between user-space and the kernel? The portion of virtual address space used by the kernel includes kernel code and the remainder is used to map physical memory. How do logical addresses relate to this situation?

Physical addresses to memory, in a system with 4096-byte pages, consist of 12 bits that define a offset into a page and the remaining bits specify the page.

The global address space in a processor is defined by a map. For example, the Zynq UltraScale+ MPSoC (which contains 64-bit ARM Cortex-A53 cores) consists of a hierarchy of inclusive map addresses.1 That includes a 32-bit (4 GB) map, 36-bit (64 GB) map and 40-bit (1 TB) map. DDR low (2 GB) is stored in the first map, while DDR high (32 GB) is stored in the second. PCIe is divided between all three. This is done because the heterogeneous system includes a 32-bit Cortex-R5 core and to provide compatibility with 32-bit software that may have been implemented on the 32-bit Zynq-7000.

Bus addresses often match the physical addresses used by the processor.

The ZynqMP uses ARM’s IOMMU that is called the System MMU (SMMU). The SMMU isolates peripherals to protect system memory and provides translation for devices with limited addressing capability.

Intel’s x86 architecture does not include an IOMMU, but it is implemented in Intel VT-d and AMD-Vi, which are designed to provide virtual machines with direct access to peripherals like Ethernet, graphics cards and persistent storage.

The ADSP-SC596/8 contain a 64-bit Cortex-A55 core and one or two 32-bit SHARC cores. The Dynamic Memory Controller (DMC) has a 16-bit wide bus to DDR3 memory and can map up to 1 GiB of DDR3 into a region within 0x8000 0000 to 0xC000 0000. The top 1 GiB of address space is reserved. Despite having 1 GiB or potentially 2 GiB (using the reserved space), the ADSP-SC59x is limited to 512 MB of DDR becuase the DRAM controller only supports a 16-bit bus to the DDR and it also does not support ranking multiple DDR modules (since it only supports one chip select). Additionally, 1 GiB DDR3 modules are very expensive.

Since the A55 operates on a 32-bit system it may make sense to run a 32-bit userspace to save memory in exchange for performance.

See Heterogeneous Memory Management (HMM) documentation in the kernel for more information.

PCIe

A PCIe link is the physical connection between devices that consists of lanes, which are differential signal pairs in both directions. So 32 lanes would require 128 signals which is calculated by 2 signals * 2 directions * 32 lanes. Instead of including a clock signal transitions around data are detected using a PLL.


  1. MPSoC address map on the Open Hardware Repository↩︎