Re: [v2] drm/mgag200: Add a workaround for low-latency

Jocelyn Falempe <jfalempe@xxxxxxxxxx> · Tue, 12 Mar 2024 15:25:08 +0100

On 12/03/2024 13:56, Sui Jingfeng wrote:
Hi,

Interesting patch! I know this patch already merged.
While study this patch, I have a few questions.

On 2024/2/8 17:51, Jocelyn Falempe wrote:
We found a regression in v5.10 on real-time server, using the
rt-kernel and the mgag200 driver. It's some really specialized
workload, with <10us latency expectation on isolated core.
After the v5.10, the real time tasks missed their <10us latency
when something prints on the screen (fbcon or printk)

The regression has been bisected to 2 commits:
commit 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages")
commit 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail")

The first one changed the system memory framebuffer from Write-Combine
to the default caching.
Before the second commit, the mgag200 driver used to unmap the
framebuffer after each frame, which implicitly does a cache flush.

I don't know why it need to do a cache flush, where is the code.
I'm asking because I want to study this technique.

Generally speaking, X86-64 platform's default page caching is cached.
And I think the cached mapping is fastest for software rendering. And
the platform guaranteed the coherency for us, right?

Because X86-64 platform(or CPU)'s write buffer is implemented on the
top of cache? I'm means that for ARM(or other) CPU, when using 
Write-combine
the data will has nothing to do with cache.

Both regressions are fixed by this commit, which restore WC mapping
for the framebuffer in system memory, and add a cache flush.

So switch back to WC probably will decrease overall performance, I think.
And the cache flush operation should not have a impact. Except X86-64's
Write-Combine is different other platform's Write-Combine?

Yes this patch is a bit weird. Usually you want your VRAM mapping to be 
Write-Combine. Here it also set the system memory framebuffer as 
Write-Combine. On most x86-64, Write Combine uses its own hardware 
buffer that is not in L1/L2/L3. So when it copies the framebuffer from 
WC system memory to VRAM, it doesn't involve the cache, and have less 
impact on latency for other tasks running on other CPU.
Also I think the cache flush is important to flush those WC buffers, so 
when the next frame comes, it won't have to wait for the buffers to be 
copied to the slow VRAM.
When running the latency tests, it's obvious that both are needed.
This is how I understand it, but I may be wrong.

--

Jocelyn

This is only needed on x86_64, for low-latency workload,
so the new kconfig DRM_MGAG200_IOBURST_WORKAROUND depends on
PREEMPT_RT and X86.

For more context, the whole thread can be found here [1]

Signed-off-by: Jocelyn Falempe <jfalempe@xxxxxxxxxx>
Link: 
https://lore.kernel.org/dri-devel/20231019135655.313759-1-jfalempe@xxxxxxxxxx/ # 1
Acked-by: Thomas Zimmermann <tzimmermann@xxxxxxx>