Re: [PATCH RFC 18/18] drm/asahi: Add the Asahi driver for Apple AGX GPUs

Asahi Lina <lina@xxxxxxxxxxxxx> · Thu, 6 Apr 2023 14:09:21 +0900

Argh. This (and my other reply) was supposed to go to Daniel, but 
Thunderbird... just dropped that recipient? And then my silly brain saw 
all the Cc:s go to To: and figured it was some weird consolidation and 
so I moved everything to Cc: except the only name that started with "Da" 
and... yeah, that wasn't the same person.

Sorry for the confusion... I have no idea why Thunderbird hates Daniel...

On 06/04/2023 13.44, Asahi Lina wrote:
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the
+/// driver.
+pub(crate) struct ID(AtomicU64);
+
+impl ID {
+    /// Create a new ID counter with a given value.
+    fn new(val: u64) -> ID {
+        ID(AtomicU64::new(val))
+    }
+
+    /// Fetch the next unique ID.
+    pub(crate) fn next(&self) -> u64 {
+        self.0.fetch_add(1, Ordering::Relaxed)
+    }
+}

Continuing the theme of me commenting on individual things, I stumbled
over this because I noticed that there's a lot of id based lookups where I
don't expect them, and started chasing.

- For ids use xarray, not atomic counters. Yes I know dma_fence timelines
    gets this wrong, this goes back to an innocent time where we didn't
    allocate more than one timeline per engine, and no one fixed it since
    then. Yes u64 should be big enough for everyone :-/

- Attaching ID spaces to drm_device is also not great. drm is full of
    these mistakes. Much better if their per drm_file and so private to each
    client.

- They shouldn't be used for anything else than uapi id -> kernel object
    lookup at the beginning of ioctl code, and nowhere else. At least from
    skimming it seems like these are used all over the driver codebase,
    which does freak me out. At least on the C side that's a clear indicator
    for a refcount/lockin/data structure model that's not thought out at
    all.

What's going on here, what do I miss?

These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use
xarray and are per-File). Most of them are just for debugging, so that
when I enable full debug spam I have some way to correlate different
things that are happening together (this subset of interleaved log lines
relate to the same submission). Basically just object names that are
easier to read (and less of a security leak) than pointers and
guaranteed not to repeat. You could get rid of most of them and it
wouldn't affect the driver design, it just makes it very hard to see
what's going on with debug logs ^^;

There are only two that are ever used for non-debugging purposes: the VM
ID, and the File ID. Both are per-device global IDs attached to the VMs
(not the UAPI VM objects, but rather the underlyng MMU address space
managers they represent, including the kernel-internal ones) and to
Files themselves. They are used for destroying GEM objects: since the
objects are also device-global across multiple clients, I need a way to
do things like "clean up all mappings for this File" or "clean up all
mappings for this VM". There's an annoying circular reference between
GEM objects and their mappings, which is why this is explicitly coded
out in destroy paths instead of naturally happening via Drop semantics
(without that cleanup code, the circular reference leaks it).

So e.g. when a File does a GEM close or explicitly asks for all mappings
of an object to be removed, it goes out to the (possibly shared) GEM
object and tells it to drop all mappings marked as owned by that unique
File ID. When an explicit "unmap all in VM" op happens, it asks the GEM
object to drop all mappings for that underlying VM ID. Similarly, when a
UAPI VM object is dropped (in the Drop impl, so both explicitly and when
the whole File/xarray is dropped and such), that does an explicit unmap
of a special dummy object it owns which would otherwise leak since it is
not tracked as a GEM object owned by that File and therefore not handled
by GEM closing. And again along the same lines, the allocators in
alloc.rs explicitly destroy the mappings for their backing GEM objects
on Drop. All this is due to that annoying circular reference between VMs
and GEM objects that I'm not sure how to fix.

Note that if I *don't* do this (or forget to do it somewhere) the
consequence is just that we leak memory, and if you try to destroy the
wrong IDs somehow the worst that can happen is you unmap things you
shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM
cases, potentially the firmware). Rust safety guarantees still keep
things from going entirely off the rails within the kernel, since
everything that matters is reference counted (which is why these
reference cycles are possible at all).

This all started when I was looking at the panfrost driver for
reference. It does the same thing except it uses actual pointers to the
owning entities instead of IDs, and pointer comparison (see
panfrost_gem_close). Of course you could try do that in Rust too
(literally storing and comparing raw pointers that aren't owned
references), but then you're introducing a Pin<> requirement on those
objects to make their addresses stable and it feels way more icky and
error-prone than unique IDs (since addresses can be reused). panfrost
only has a single mmu (what I call the raw VM) per File while I have an
arbitrary number, which is why I end up with the extra
distinction/complexity of both File and VM IDs, but the concept is the same.

Some of this is going to be refactored when I implement arbitrary VM
range mapping/unmapping, which would be a good time to improve this...
but is there something particularly wrong/broken about the way I'm doing
it now that I missed? I figured unique u64 IDs would be a pretty safe
way to identify entities and cleanup the mappings when needed.

~~ Lina

~~ Lina