On 30.01.25 10:40, Simona Vetter wrote:
On Thu, Jan 30, 2025 at 05:11:49PM +1100, Alistair Popple wrote:
On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
We require a writable PTE and only support anonymous folio: we can only
have exactly one PTE pointing at that page, which we can just lookup
using a folio walk, avoiding the rmap walk and the anon VMA lock.
So let's stop doing an rmap walk and perform a folio walk instead, so we
can easily just modify a single PTE and avoid relying on rmap/mapcounts.
We now effectively work on a single PTE instead of multiple PTEs of
a large folio, allowing for conversion of individual PTEs from
non-exclusive to device-exclusive -- note that the other way always
worked on single PTEs.
We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
that is not required: GUP will already take care of the
MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
entry) when not finding a present PTE and having to trigger a fault and
ending up in remove_device_exclusive_entry().
I will have to look at this a bit more closely tomorrow but this doesn't seem
right to me. We may be transitioning from a present PTE (ie. a writable
anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
update their copies of the PTE. So I think the notifier call is needed.
I guess this is a question of semantics we want, for multiple gpus do we
require that device-exclusive also excludes other gpus or not. I'm leaning
towards agreeing with you here.
See my reply, it's also relevant for non-device, such as KVM. So it's
the right thing to do.
Note that the PTE is
always writable, and we can always create a writable-device-exclusive
entry.
With this change, device-exclusive is fully compatible with THPs /
large folios. We still require PMD-sized THPs to get PTE-mapped, and
supporting PMD-mapped THP (without the PTE-remapping) is a different
endeavour that might not be worth it at this point.
I'm not sure we actually want hugepages for device exclusive, since it has
an impact on what's allowed and what not. If we only ever do 4k entries
then userspace can assume that as long atomics are separated by a 4k page
there's no issue when both the gpu and cpu hammer on them. If we try to
keep thp entries then suddenly a workload that worked before will result
in endless ping-pong between gpu and cpu because the separate atomic
counters (or whatever) now all sit in the same 2m page.
Agreed. And the conversion + mapping into the device gets trickier.
So going with thp might result in userspace having to spread out atomics
even more, which is just wasting memory and not saving any tlb entries
since often you don't need that many.
tldr; I think not supporting thp entries for device exclusive is a
feature, not a bug.
So, you agree with my "different endeavour that might not be worth it"
statement?
--
Cheers,
David / dhildenb