Am 02.09.24 um 11:32 schrieb Thomas Hellström:
On Mon, 2024-09-02 at 08:13 +1000, Dave Airlie wrote:
On Fri, 30 Aug 2024 at 12:32, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
On Fri, 30 Aug 2024 at 14:08, Dave Airlie <airlied@xxxxxxxxx>
wrote:
The TTM revert is due to some stuttering graphical apps probably
due
to longer stalls while prefaulting.
Yeah, trying to pre-fault a PMD worth of pages in one go is just
crazy talk.
Now, if it was PMD-aligned and you faulted in a single PMD, that
would
be different. But just doing prn_insert_page() in a loop is insane.
The code doesn't even stop when it hits a page that already
existed,
and it keeps locking and unlocking the last-level page table over
and
over again.
Honestly, that code is questionable even for the *small* value,
much
less the "a PMD size" case.
Now, if you have an array of 'struct page *", you can use
vm_insert_pages(), and that's reasonably efficient.
And if you have a *contiguous* are of pfns, you can use
remap_pfn_range().
But that "insert one pfn at a time" that the drm layer does is
complete garbage. You're not speeding anything up, you're just
digging
deeper.
I wonder if there is functionality that could be provided in a common
helper, by the mm layers, or if there would be too many locking
interactions to make it sane,
It seems too fraught with danger for drivers or subsystems to be just
doing this in the simplest way that isn't actually that smart.
Hmm. I see even the "Don't error on prefaults" check was broken at some
point :/.
There have been numerous ways to try to address this,
The remap_pfn_range was last tried, at least in the context of the i915
driver IIRC by Christoph Hellwig but had to be ripped out since it
requires the mmap_lock in write mode. Here we have it only in read
mode.
Then there's the apply_to_page_range() used by the igfx functionality
of the i915 driver. I don't think we should go that route without
turning it into something like vm_insert_pfns() with proper checking.
This approach populates all entries of a buffer object.
Finally there's the huge fault attempt that had to be ripped out due to
lack of pmd_special and pud_special flags and resulting clashes with
gup_fast.
Perhaps a combination of the two latter if properly implemented would
be the best choice.
I'm not deep enough into the memory management background to judge which
approach is the best, just one more data point to provide:
The pre-faulting was increased because of virtualization. When KVM/XEN
is mapping a BO into a guest the switching overhead for each fault is so
high that mapping a lot of PFNs at the same time becomes beneficial.
Regards,
Christian.
/Thomas
Dave.