On Wed, Aug 16, 2023 at 10:25 AM Sarah Walker <sarah.walker@xxxxxxxxxx> wrote: > Add a GEM implementation based on drm_gem_shmem, and support code for the > PowerVR GPU MMU. The GPU VA manager is used for address space management. [...] > +/** > + * pvr_mmu_flush() - Request flush of all MMU caches. > + * @pvr_dev: Target PowerVR device. > + * > + * This function must be called following any possible change to the MMU page > + * tables. > + * > + * Returns: > + * * 0 on success, or > + * * Any error encountered while submitting the flush command via the KCCB. > + */ > +int > +pvr_mmu_flush(struct pvr_device *pvr_dev) > +{ > + /* TODO: implement */ > + return -ENODEV; > +} pvr_mmu_flush() being an operation that can fail looks dodgy to me. Especially given that a later patch implements pvr_mmu_flush() such that it looks like it can hit a transient failure on the path: pvr_mmu_flush pvr_kccb_send_cmd pvr_kccb_send_cmd_powered pvr_kccb_reserve_slot_sync pvr_kccb_reserve_slot_sync() even looks like it could transiently fail without even once calling pvr_kccb_try_reserve_slot() if it gets preempted until RESERVE_SLOT_TIMEOUT time has passed between the first and the second read of "jiffies". (Which is probably not very likely to happen by chance, but Android devices are typically configured with full kernel preemption, so if an attacker causes pvr_mmu_flush() to run in a process with a SCHED_IDLE scheduling policy and deliberately preempts the process at the right point, they might be able to achieve this.) A more robust retry pattern might be to do a fixed number of retry iterations with sleeps in between instead of retrying until a fixed amount of time has passed; though I still wouldn't want to rely on that for making sure that a TLB flush happens. In my opinion, any error path of pvr_mmu_flush() should guarantee that by the time it returns, the address space is no longer used by the GPU and can never be used by the GPU again. [...] > +/** > + * struct pvr_mmu_op_context - context holding data for individual > + * device-virtual mapping operations. Intended for use with a VM bind operation. > + */ > +struct pvr_mmu_op_context { > + /** @mmu_ctx: The MMU context associated with the owning VM context. */ > + struct pvr_mmu_context *mmu_ctx; > + > + /** @map: Data specifically for map operations. */ > + struct { > + /** > + * @sgt: Scatter gather table containing pages pinned for use by > + * this context - these are currently pinned when initialising > + * the VM bind operation. > + */ > + struct sg_table *sgt; > + > + /** @sgt_offset: Start address of the device-virtual mapping. */ > + u64 sgt_offset; > + } map; > + > + /** > + * @l1_free_tables: Preallocated l1 page table objects for use by this > + * context when creating a page mapping. Linked list created during > + * initialisation. Also used to collect page table objects freed by an > + * unmap. > + */ > + struct pvr_page_table_l1 *l1_free_tables; > + > + /** > + * @l0_free_tables: Preallocated l0 page table objects for use by this > + * context when creating a page mapping. Linked list created during > + * initialisation. Also used to collect page table objects freed by an > + * unmap. > + */ > + struct pvr_page_table_l0 *l0_free_tables; The free page table lists are shared between page table allocation and freeing within one operation, and they have last-in-first-out (stack) behavior, which means that when a pvr_vm_map() invocation does pvr_vm_gpuva_unmap() invocations followed by pvr_vm_gpuva_map() invocations, it can end up immediately reusing the freed page tables within the same operation, right? Since pvr_mmu_flush() only happens in pvr_mmu_op_context_destroy() at the end of pvr_vm_map(), that means a concurrent GPU page table walk could walk down to the old address where a page table used to be mapped, and then observe the page table entries that were created at the new address to which the page table was moved? Like: GPU AP === == load L2 PTE for VA A load L1 PTE for VA A, it contains a reference to L0 table T1 pvr_page_table_l1_remove() removes L0 table T1 for VA A pvr_page_table_l1_remove() removes L0 table T2 for VA B pvr_page_table_l1_insert() inserts L0 table T2 at VA A pvr_page_table_l1_insert() inserts L0 table T1 at VA B pvr_page_table_l0_insert() inserts PTE into T1 for VA B load L0 PTE for VA A from L0 table T1 And since page tables also don't seem to be cache-flushed when they are put on these freelists, this could maybe also happen the other way around: The GPU walks down into a page table that was moved to a new address, but observes PTEs from the old address at which the page table was previously mapped? > + > + /** > + * @curr_page - A reference to a single physical page as indexed by > + * the page table structure. > + */ > + struct pvr_page_table_ptr curr_page; > + > + /** > + * @sync_level_required: The maximum level of the page table tree > + * structure which has (possibly) been modified since it was last > + * flushed to the device. > + * > + * This field should only be set with pvr_mmu_op_context_require_sync() > + * or indirectly by pvr_mmu_op_context_sync_partial(). > + */ > + enum pvr_mmu_sync_level sync_level_required; > +}; [...] > +/** > + * pvr_mmu_unmap() - Unmap pages from a memory context. > + * @op_ctx: Target MMU op context. > + * @device_addr: First device-virtual address to unmap. > + * @size: Size in bytes to unmap. > + * > + * The total amount of device-virtual memory unmapped is > + * @nr_pages * %PVR_DEVICE_PAGE_SIZE. > + * > + * Returns: > + * * 0 on success, or > + * * Any error code returned by pvr_page_table_ptr_init(), or > + * * Any error code returned by pvr_page_table_ptr_unmap(). > + */ > +int pvr_mmu_unmap(struct pvr_mmu_op_context *op_ctx, u64 device_addr, u64 size) > +{ > + int err = pvr_mmu_op_context_set_curr_page(op_ctx, device_addr, false); > + > + if (err) > + return err; > + > + return pvr_mmu_op_context_unmap_curr_page(op_ctx, > + size >> PVR_DEVICE_PAGE_SHIFT); > +} I think we can get here in the middle of this call path: pvr_ioctl_vm_unmap pvr_vm_unmap pvr_mmu_op_context_create mutex_lock(&vm_ctx->lock); drm_gpuva_sm_unmap __drm_gpuva_sm_unmap loop: op_unmap_cb [conditional] pvr_vm_gpuva_unmap pvr_mmu_unmap [WE ARE HERE] pvr_mmu_op_context_set_curr_page pvr_mmu_op_context_sync [CACHE FLUSH] pvr_mmu_op_context_load_tables pvr_mmu_op_context_unmap_curr_page pvr_page_destroy [conditional] loop: pvr_mmu_op_context_next_page pvr_mmu_op_context_sync_partial [CACHE FLUSH] pvr_mmu_op_context_load_tables pvr_page_destroy pvr_page_table_l0_remove [REMOVES PTE] pvr_page_table_l1_remove [conditional] pvr_mmu_op_context_require_sync drm_gpuva_unmap drm_gpuva_unlink kfree(op->unmap.va) pvr_gem_object_put [FREES PAGES] mutex_unlock(&vm_ctx->lock); pvr_mmu_op_context_destroy pvr_mmu_op_context_sync [CACHE FLUSH] pvr_mmu_flush [FLUSHES MMU] loop: pvr_page_table_l0_free loop: pvr_page_table_l1_free >From what I can tell, we can get from `pvr_page_table_l0_remove` (where the GPU PTE is cleared) to `pvr_gem_object_put` (where the page referenced by the PTE can be freed) without going through a page table cache flush or a MMU flush; I think we need both (unless the pvr_gem_object_put() is somehow deferred until pvr_mmu_op_context_destroy() is reached).