Sequential DMA bursts improve NIC/RAM usage thanks to the basic NIC hardware optimizations available when performing in-order sequential accesses. This can be further enforced with the IPU DMA locking mechanism which basically prevents any other IP to access the interconnect for a longer time while performing up to 8 sequential DMA bursts. The drawback is a lower availability for short time periods and delayed accesses which may cause problem with latency-sensible systems (typically, the network might suffer from high drop rates). This is even more visible with larger displays requiring even more RAM bandwidth. Issues have been observed on IMX6Q. The setup featured a 60Hz 1024x768 LVDS display just showing a static picture (thus no CPU usage, only background DMA bringing the picture to the display engine). When performing full speed iperf3 uplink tests with the FEC, almost no drop was observed, whereas the drop would raise above 50% when limiting the bandwidth to 1Mb/s (on a 100Mb/s link). The exact same test with the display pipeline disabled would show little to no drop. The LP-DDR3 chip on the module would allow up to ~53MiB each 1/60th of a second, and the display pipeline consume approximately ~10MiB of this bandwidth, and thus be active 20% of the time on each time slot. One particular feature of the IPU DMA controller (IDMAC) is the ability to serialize DMA bursts and to lock further interconnect accesses when doing so. Experimentally, disabling the locking lead to a drop rate from 50% down to 10%. A few more % could be earned by setting the burst number to 1. It seems this huge difference could be explained by a possible hardware conflict between the locking feature and some QoS logic. Indeed, on IMX6Q, the NIC-301 manages priorities and by default will elect ENET's requests (priority 2) above IPU's requests (priority 0). But the QoS seems to only be valid above a certain threshold, which is: 4 consequent DMA bursts in the case of the IPU. It was indeed observed that tweaking the number of bursts to be lowered from 8 to 4 would lead to a significant increase in the Ethernet transfers stability. IOW, it looks like when the display pipeline performs DMA transfers, incoming DMA requests from other master devices on the interconnect are delayed too much (or canceled). I have no clue to explain why on the Ethernet MAC side some uDMA transfers would never reach completion, especially without notification nor any error. All uplink transfers are properly queued at the FEC level and more importantly, the corresponding interrupts are fired upon "proper transmission" and report no error whatsoever (note: there is no actual way to know the uDMA internal controller could not fetch the data, only MAC errors could be reported at this stage). As a solution, we might want to prevent these DMA bursts from being queued together. Maybe the IMX6Q is primarily used for its graphics capabilities, but when the network (and other RAM consuming subsystem) also matter, it may be relevant to apply this workaround in order to help them fetching from RAM more reliably. Signed-off-by: Miquel Raynal <miquel.raynal@xxxxxxxxxxx> --- Hello, This really is an RFC as the bug was also observed on v6.5 but the fix proposed here was written and tested on a v4.14 kernel. I want to discuss the approach and ideally get some feedback from imx6 experts who know the SoC internals before publishing a clean series. There is a lot of guessing in this workaround, besides the experimental measures I managed to do. I would be glad if someone could sched any light or involve knowledgeable people in this conversation. The initial report was there and mainly focused on the network subsystem: https://lore.kernel.org/netdev/18b72fdb-d24a-a416-ffab-3a15b281a6e0@xxxxxxxxxxx/T/#md265d6da81b8fb6b85e3adbb399bcda79dfc761c In this thread I made wrong observations because for speeding up my test cycles, I dropped the support for: DRM, SND, USB as these subsystems seemed totally irrelevant. It actually had a strong impact. In the end, I really think there is something wrong with the locking of IPU DMA bursts when mixed with the QoS of the NIC. Any comments welcome! [I know this deserves a DT binding entry, but we are not there yet] Thanks, Miquèl drivers/gpu/drm/imx/ipuv3-plane.c | 27 ++++++++++++++++++++++++--- drivers/gpu/drm/imx/ipuv3-plane.h | 1 + 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/imx/ipuv3-plane.c b/drivers/gpu/drm/imx/ipuv3-plane.c index cf98596c7ce1..0492df96ac3e 100644 --- a/drivers/gpu/drm/imx/ipuv3-plane.c +++ b/drivers/gpu/drm/imx/ipuv3-plane.c @@ -496,8 +496,8 @@ static int ipu_chan_assign_axi_id(int ipu_chan) } } -static void ipu_calculate_bursts(u32 width, u32 cpp, u32 stride, - u8 *burstsize, u8 *num_bursts) +static void ipu_calculate_bursts(struct ipu_plane *ipu_plane, u32 width, u32 cpp, + u32 stride, u8 *burstsize, u8 *num_bursts) { const unsigned int width_bytes = width * cpp; unsigned int npb, bursts; @@ -509,6 +509,11 @@ static void ipu_calculate_bursts(u32 width, u32 cpp, u32 stride, } *burstsize = npb; + if (ipu_plane->no_sequential_dma_bursts) { + *num_bursts = 1; + return; + } + /* Maximum number of consecutive bursts without overshooting stride */ for (bursts = 8; bursts > 1; bursts /= 2) { if (round_up(width_bytes, npb * cpp * bursts) <= stride) @@ -608,7 +613,7 @@ static void ipu_plane_atomic_update(struct drm_plane *plane, width = drm_rect_width(&state->src) >> 16; height = drm_rect_height(&state->src) >> 16; info = drm_format_info(fb->format->format); - ipu_calculate_bursts(width, info->cpp[0], fb->pitches[0], + ipu_calculate_bursts(ipu_plane, width, info->cpp[0], fb->pitches[0], &burstsize, &num_bursts); ipu_cpmem_zero(ipu_plane->ipu_ch); @@ -730,6 +735,7 @@ struct ipu_plane *ipu_plane_init(struct drm_device *dev, struct ipu_soc *ipu, enum drm_plane_type type) { struct ipu_plane *ipu_plane; + struct device_node *dn; int ret; DRM_DEBUG_KMS("channel %d, dp flow %d, possible_crtcs=0x%x\n", @@ -745,6 +751,21 @@ struct ipu_plane *ipu_plane_init(struct drm_device *dev, struct ipu_soc *ipu, ipu_plane->dma = dma; ipu_plane->dp_flow = dp; + /* + * Sequential DMA bursts improve NIC/RAM usage thanks to the basic NIC + * hardware optimizations available when performing in-order sequential + * accesses. This can be further enforced with the IPU DMA locking + * mechanism which basically prevents any other IP to access the + * interconnect for a longer time while performing up to 8 sequential + * DMA bursts. The drawback is a lower availability for short time + * periods and delayed accesses which may cause problem with + * latency-sensible systems (typically, the network might suffer from + * high drop rates). This is even more visible with larger displays + * requiring even more RAM bandwidth. + */ + dn = of_find_compatible_node(NULL, NULL, "fsl,imx6q-ipu"); + ipu_plane->no_sequential_dma_bursts = of_property_read_bool(dn, "no-sequential-dma-bursts"); + ret = drm_universal_plane_init(dev, &ipu_plane->base, possible_crtcs, &ipu_plane_funcs, ipu_plane_formats, ARRAY_SIZE(ipu_plane_formats), diff --git a/drivers/gpu/drm/imx/ipuv3-plane.h b/drivers/gpu/drm/imx/ipuv3-plane.h index e563ea17a827..743e92d5a940 100644 --- a/drivers/gpu/drm/imx/ipuv3-plane.h +++ b/drivers/gpu/drm/imx/ipuv3-plane.h @@ -25,6 +25,7 @@ struct ipu_plane { int dma; int dp_flow; + bool no_sequential_dma_bursts; bool disabling; }; -- 2.34.1