Re: [PATCH v5 08/17] drm/imagination: Add GEM and VM related code

Jann Horn <jannh@xxxxxxxxxx> · Fri, 18 Aug 2023 00:42:09 +0200

Hi!

Thanks, I think it's great that Imagination is writing an upstream
driver for PowerVR. :)

On Wed, Aug 16, 2023 at 10:25 AM Sarah Walker <sarah.walker@xxxxxxxxxx> wrote:
> Add a GEM implementation based on drm_gem_shmem, and support code for the
> PowerVR GPU MMU. The GPU VA manager is used for address space management.
[...]
> +/**
> + * DOC: Flags for DRM_IOCTL_PVR_CREATE_BO (kernel-only)
> + *
> + * Kernel-only values allowed in &pvr_gem_object->flags. The majority of options
> + * for this field are specified in the UAPI header "pvr_drm.h" with a
> + * DRM_PVR_BO_ prefix. To distinguish these internal options (which must exist
> + * in ranges marked as "reserved" in the UAPI header), we drop the DRM prefix.
> + * The public options should be used directly, DRM prefix and all.
> + *
> + * To avoid potentially confusing gaps in the UAPI options, these kernel-only
> + * options are specified "in reverse", starting at bit 63.
> + *
> + * We use "reserved" to refer to bits defined here and not exposed in the UAPI.
> + * Bits not defined anywhere are "undefined".
> + *
> + * Creation options
> + *    These use the prefix PVR_BO_CREATE_.
> + *
> + *    *There are currently no kernel-only flags in this group.*
> + *
> + * Device mapping options
> + *    These use the prefix PVR_BO_DEVICE_.
> + *
> + *    *There are currently no kernel-only flags in this group.*
> + *
> + * CPU mapping options
> + *    These use the prefix PVR_BO_CPU_.
> + *
> + *    :CACHED: By default, all GEM objects are mapped write-combined on the
> + *       CPU. Set this flag to override this behaviour and map the object
> + *       cached.
> + */
> +#define PVR_BO_CPU_CACHED BIT_ULL(63)
> +
> +#define PVR_BO_FW_NO_CLEAR_ON_RESET BIT_ULL(62)
> +
> +/* Bits 62..3 are undefined. */
> +/* Bits 2..0 are defined in the UAPI. */
> +
> +/* Other utilities. */
> +#define PVR_BO_UNDEFINED_MASK GENMASK_ULL(61, 3)
> +#define PVR_BO_RESERVED_MASK (PVR_BO_UNDEFINED_MASK | GENMASK_ULL(63, 63))

In commit 1a9c568fb559 ("drm/imagination: Rework firmware object
initialisation") in powervr-next, PVR_BO_FW_NO_CLEAR_ON_RESET (bit 62)
was added in the kernel-only flags group, but the mask
PVR_BO_RESERVED_MASK (which is used in pvr_ioctl_create_bo to detect
kernel-only and reserved flags) looks like it wasn't changed to
include bit 62. I think that means it becomes possible for userspace
to pass this bit in via pvr_ioctl_create_bo()?
If my understanding is correct and that was unintentional, it might be
a good idea to change these defines:

#define PVR_BO_UNDEFINED_MASK GENMASK_ULL(61, 3)
#define PVR_BO_RESERVED_MASK (PVR_BO_UNDEFINED_MASK | GENMASK_ULL(63, 63))

into something like this to avoid future mishaps like this:

/* first bit that is not used for UAPI BO options */
#define PVR_BO_FIRST_RESERVED_BIT 3
#define PVR_BO_UNDEFINED_MASK GENMASK_ULL(61, PVR_BO_FIRST_RESERVED_BIT)
#define PVR_BO_RESERVED_MASK GENMASK_ULL(63, PVR_BO_FIRST_RESERVED_BIT)

> +
> +/*
> + * All firmware-mapped memory uses (mostly) the same flags. Specifically,
> + * firmware-mapped memory should be:
> + *  * Read/write on the device,
> + *  * Read/write on the CPU, and
> + *  * Write-combined on the CPU.
> + *
> + * The only variation is in caching on the device.
> + */
> +#define PVR_BO_FW_FLAGS_DEVICE_CACHED (ULL(0))
> +#define PVR_BO_FW_FLAGS_DEVICE_UNCACHED DRM_PVR_BO_DEVICE_BYPASS_CACHE
[...]
> +/**
> + * pvr_page_table_l2_entry_raw_set() - Write a valid entry into a raw level 2
> + *                                     page table.
> + * @entry: Target raw level 2 page table entry.
> + * @child_table_dma_addr: DMA address of the level 1 page table to be
> + *                        associated with @entry.
> + *
> + * When calling this function, @child_table_dma_addr must be a valid DMA
> + * address and a multiple of %ROGUE_MMUCTRL_PC_DATA_PD_BASE_ALIGNSIZE.
> + */
> +static void
> +pvr_page_table_l2_entry_raw_set(struct pvr_page_table_l2_entry_raw *entry,
> +                               dma_addr_t child_table_dma_addr)
> +{
> +       child_table_dma_addr >>= ROGUE_MMUCTRL_PC_DATA_PD_BASE_ALIGNSHIFT;
> +
> +       entry->val =
> +               PVR_PAGE_TABLE_FIELD_PREP(2, PC, VALID, true) |
> +               PVR_PAGE_TABLE_FIELD_PREP(2, PC, ENTRY_PENDING, false) |
> +               PVR_PAGE_TABLE_FIELD_PREP(2, PC, PD_BASE, child_table_dma_addr);
> +}

For this function and others that manipulate page table entries,
please use some kernel helper that ensures that the store can't tear
(at least WRITE_ONCE() - that can still tear on 32-bit, but I see the
driver depends on ARM64, so that's not a problem).

[...]
> +/**
> + * pvr_page_table_l2_insert() - Insert an entry referring to a level 1 page
> + * table into a level 2 page table.
> + * @op_ctx: Target MMU op context pointing at the entry to insert the L1 page
> + * table into.
> + * @child_table: Target level 1 page table to be referenced by the new entry.
> + *
> + * It is the caller's responsibility to ensure @op_ctx.curr_page points to a
> + * valid L2 entry.
> + */
> +static void
> +pvr_page_table_l2_insert(struct pvr_mmu_op_context *op_ctx,
> +                        struct pvr_page_table_l1 *child_table)
> +{
> +       struct pvr_page_table_l2 *l2_table =
> +               &op_ctx->mmu_ctx->page_table_l2;
> +       struct pvr_page_table_l2_entry_raw *entry_raw =
> +               pvr_page_table_l2_get_entry_raw(l2_table,
> +                                               op_ctx->curr_page.l2_idx);
> +
> +       pvr_page_table_l2_entry_raw_set(entry_raw,
> +                                       child_table->backing_page.dma_addr);

Can you maybe add comments in functions that set page table entries to
document who is responsible for using a memory barrier (like wmb()) to
ensure that the creation of a page table entry is ordered after the
thing it points to is fully initialized, so that the GPU can't end up
concurrently walking into a page table and observe its old contents
from before it was zero-initialized?

> +
> +       child_table->parent = l2_table;
> +       child_table->parent_idx = op_ctx->curr_page.l2_idx;
> +       l2_table->entries[op_ctx->curr_page.l2_idx] = child_table;
> +       ++l2_table->entry_count;
> +       op_ctx->curr_page.l1_table = child_table;
> +}
[...]
> +/**
> + * pvr_page_table_l1_get_or_insert() - Retrieves (optionally inserting if
> + * necessary) a level 1 page table from the specified level 2 page table entry.
> + * @op_ctx: Target MMU op context.
> + * @should_insert: [IN] Specifies whether new page tables should be inserted
> + * when empty page table entries are encountered during traversal.
> + *
> + * Return:
> + *  * 0 on success, or
> + *
> + *    If @should_insert is %false:
> + *     * -%ENXIO if a level 1 page table would have been inserted.
> + *
> + *    If @should_insert is %true:
> + *     * Any error encountered while inserting the level 1 page table.
> + */
> +static int
> +pvr_page_table_l1_get_or_insert(struct pvr_mmu_op_context *op_ctx,
> +                               bool should_insert)
> +{
> +       struct pvr_page_table_l2 *l2_table =
> +               &op_ctx->mmu_ctx->page_table_l2;
> +       struct pvr_page_table_l1 *table;
> +       int err;
> +
> +       if (pvr_page_table_l2_entry_is_valid(l2_table,
> +                                            op_ctx->curr_page.l2_idx)) {
> +               op_ctx->curr_page.l1_table =
> +                       l2_table->entries[op_ctx->curr_page.l2_idx];
> +               return 0;
> +       }
> +
> +       if (!should_insert)
> +               return -ENXIO;
> +
> +       /* Take a prealloced table. */
> +       table = op_ctx->l1_free_tables;
> +       if (!table)
> +               return -ENOMEM;
> +
> +       err = pvr_page_table_l1_init(table, op_ctx->mmu_ctx->pvr_dev);

I think when we have a preallocated table here, it was allocated in
pvr_page_table_l1_alloc(), which already called
pvr_page_table_l1_init()? So it looks to me like this second
pvr_page_table_l1_init() call will allocate another page and leak the
old allocation.

> +       if (err)
> +               return err;
> +
> +       /* Pop */
> +       op_ctx->l1_free_tables = table->next_free;
> +       table->next_free = NULL;
> +
> +       pvr_page_table_l2_insert(op_ctx, table);
> +
> +       return 0;
> +}
[...]
> +/**
> + * pvr_mmu_op_context_create() - Create an MMU op context.
> + * @ctx: MMU context associated with owning VM context.
> + * @sgt: Scatter gather table containing pages pinned for use by this context.
> + * @sgt_offset: Start offset of the requested device-virtual memory mapping.
> + * @size: Size in bytes of the requested device-virtual memory mapping. For an
> + * unmapping, this should be zero so that no page tables are allocated.
> + *
> + * Returns:
> + *  * Newly created MMU op context object on success, or
> + *  * -%ENOMEM if no memory is available,
> + *  * Any error code returned by pvr_page_table_l2_init().
> + */
> +struct pvr_mmu_op_context *
> +pvr_mmu_op_context_create(struct pvr_mmu_context *ctx, struct sg_table *sgt,
> +                         u64 sgt_offset, u64 size)
> +{
> +       int err;
> +
> +       struct pvr_mmu_op_context *op_ctx =
> +               kzalloc(sizeof(*op_ctx), GFP_KERNEL);
> +
> +       if (!op_ctx)
> +               return ERR_PTR(-ENOMEM);
> +
> +       op_ctx->mmu_ctx = ctx;
> +       op_ctx->map.sgt = sgt;
> +       op_ctx->map.sgt_offset = sgt_offset;
> +       op_ctx->sync_level_required = PVR_MMU_SYNC_LEVEL_NONE;
> +
> +       if (size) {
> +               const u32 l1_start_idx = pvr_page_table_l2_idx(sgt_offset);
> +               const u32 l1_end_idx = pvr_page_table_l2_idx(sgt_offset + size);
> +               const u32 l1_count = l1_end_idx - l1_start_idx + 1;
> +               const u32 l0_start_idx = pvr_page_table_l1_idx(sgt_offset);
> +               const u32 l0_end_idx = pvr_page_table_l1_idx(sgt_offset + size);
> +               const u32 l0_count = l0_end_idx - l0_start_idx + 1;

Shouldn't the page table indices be calculated from the device_addr
(which is not currently passed in by pvr_vm_map())? As far as I can
tell, sgt_offset doesn't have anything to do with the device address
at which this mapping will be inserted in the page tables?

> +
> +               /*
> +                * Alloc and push page table entries until we have enough of
> +                * each type, ending with linked lists of l0 and l1 entries in
> +                * reverse order.
> +                */
> +               for (int i = 0; i < l1_count; i++) {
> +                       struct pvr_page_table_l1 *l1_tmp =
> +                               pvr_page_table_l1_alloc(ctx);
> +
> +                       err = PTR_ERR_OR_ZERO(l1_tmp);
> +                       if (err)
> +                               goto err_cleanup;
> +
> +                       l1_tmp->next_free = op_ctx->l1_free_tables;
> +                       op_ctx->l1_free_tables = l1_tmp;
> +               }
> +
> +               for (int i = 0; i < l0_count; i++) {
> +                       struct pvr_page_table_l0 *l0_tmp =
> +                               pvr_page_table_l0_alloc(ctx);
> +
> +                       err = PTR_ERR_OR_ZERO(l0_tmp);
> +                       if (err)
> +                               goto err_cleanup;
> +
> +                       l0_tmp->next_free = op_ctx->l0_free_tables;
> +                       op_ctx->l0_free_tables = l0_tmp;
> +               }
> +       }
> +
> +       return op_ctx;
> +
> +err_cleanup:
> +       pvr_mmu_op_context_destroy(op_ctx);

> +
> +       return ERR_PTR(err);
> +}