On 06/11/2023 11:46, Jocelyn Falempe wrote:
On 23/10/2023 10:30, Jocelyn Falempe wrote:
On 20/10/2023 14:06, Thomas Zimmermann wrote:
(cc'ing lkml for feedback)
Hi Jocelyn
Am 19.10.23 um 15:55 schrieb Jocelyn Falempe:
We found a regression in v5.10 on real-time server, using the
rt-kernel and the mgag200 driver. It's some really specialized
workload, with <10us latency expectation on isolated core.
After the v5.10, the real time tasks missed their <10us latency
when something prints on the screen (fbcon or printk)
I'd like to hear the opinion of the RT-devs on this patch. Because
AFAIK we never did such a workaround in other drivers. And AFAIK
printk is a PITA anyway.
Most other drivers uses DMA, which means this workaround can't apply
to them.
IMHO if that RT system cannot handle differences in framebuffer
caching, it's under-powered. It's just a matter of time until
something else changes and the problem returns. And (honest question)
as it's an x86-64, how do they handle System Management Mode?
I think it's not a big news, that the Matrox G200 from 1999 is
under-powered.
I was also a bit surprised that flushing the cache would have such
effect on latency. The tests we are doing can run 24h with the
workaround, without any interrupt taking more than 10us. Without the
workaround, every ~30s the interrupt failed its 10us target.
The regression has been bisected to 2 commits:
0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages")
4862ffaec523 ("drm/mgag200: Move vmap out of commit tail")
The first one changed the system memory framebuffer from Write-Combine
to the default caching.
Before the second commit, the mgag200 driver used to unmap the
framebuffer after each frame, which implicitly does a cache flush.
Both regressions are fixed by the following patch, which forces a
cache flush after each frame, reverting to almost v5.9 behavior.
With that second commit, we essentially never unmap an active
framebuffer console. But with commit
359c6649cd9a ("drm/gem: Implement shadow-plane {begin, end}_fb_access
with vmap")
we now again unmap the console framebuffer after the pageflip happened.
So how does the latest kernel behave wrt to the problem?
The regression was found when upgrading the server from v5.4 to v5.14,
so we didn't test with later kernels.
We will test with v6.3 (which should have 359c6649cd9a ) and see what
it gives.
I don't have a clear explanation, but testing with v6.3, and forcing the
Write Combine, doesn't fix the latency issue. So forcing the cache flush
is still needed.
Also, on some systems, they use "isolated cpu" to handle RT task, but
with a standard kernel (so without the CONFIG_PREEMPT_RT).
So I'm wondering if we can use a kernel module parameter for this,
so that users that wants to achieve low latency, can opt-in ?
something like mgag200.force_cache_flush=1 or mgag200.low_latency=1 ?
Hi,
I have now access to a server that reproduce the issue, and I was able
to test different workarounds.
So it is definitely related to the "Write Combine" mode of the mga
internal RAM. If I comment the two lines to enable wc:
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/mgag200/mgag200_drv.c#L150,
then the latency is <10us (but the performances are worse, from 20ms to
87ms to draw a full frame).
I also tried to flush the vram using:
drm_clflush_virt_range(mdev->vram + clip->y1 * fb->pitches[0],
drm_rect_height(clip) * fb->pitches[0]);
And that lower the latency to ~20us, but it's not enough.
I tried "sfence" which I though would flush the WC buffers of the CPU,
but that has no effect in practice.
I think I can send a new patch, to not map the VRAM as Write Combine,
either if CONFIG_PREEMPT_RT is set or if a module parameter is set.
What do you think is the best approach ?
My tests setup:
I either flood the console with "cat /dev/urandom | base64" from the
BMC, or write to fb0 with "while true; do dd if=/dev/urandom of=/dev/fb0
; done"
then I run:
cd /sys/kernel/debug/tracing
echo 0 > tracing_on
echo 950000 > hwlat_detector/width
echo hwlat > current_tracer
echo 10 > tracing_thresh
echo 1 > tracing_on
sleep 300
cat trace
echo 0 > tracing_on
echo nop > current_tracer
Test Results:
V6.6 (~40us latency)
# tracer: hwlat
#
# entries-in-buffer/entries-written: 6/6 #P:56
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<...>-8795 [001] dn... 613.749741: #1 inner/outer(us):
41/41 ts:1702045267.016266287 count:9976
<...>-8804 [000] dn... 675.201727: #2 inner/outer(us):
40/40 ts:1702045328.469859973 count:6506
<...>-8804 [000] dn... 731.435258: #3 inner/outer(us):
40/41 ts:1702045384.704861049 count:9578
<...>-8804 [000] d.... 787.684533: #4 inner/outer(us):
41/41 ts:1702045440.955603974 count:22591
<...>-8804 [000] d.... 844.303041: #5 inner/outer(us):
41/41 ts:1702045497.575594006 count:33324
<...>-8804 [000] d.... 900.494844: #6 inner/outer(us):
40/40 ts:1702045553.768864888 count:1924
V6.6 + clfush mdev->vram (~20us latency)
# tracer: hwlat
#
# entries-in-buffer/entries-written: 3/3 #P:56
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<...>-8983 [001] d.... 534.595415: #1 inner/outer(us):
21/20 ts:1702024432.844243126 count:3018
<...>-8983 [000] dn... 758.560453: #2 inner/outer(us):
21/21 ts:1702024656.824367474 count:22238
<...>-8983 [000] dn... 815.117057: #3 inner/outer(us):
21/21 ts:1702024713.373220580 count:26455
V6.6 + no Write Combine for VRAM (<10us latency)
# tracer: hwlat
#
# entries-in-buffer/entries-written: 0/0 #P:56
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
Best regards,
--
Jocelyn