On Thu, Feb 27, 2025 at 09:51:15AM -0700, Cavitt, Jonathan wrote: > Some responses below. If I skip over anything, just assume that I'm taking the request > into consideration and that it will be fixed for version 2 of this patch series. > > -----Original Message----- > From: Brost, Matthew <matthew.brost@xxxxxxxxx> > Sent: Thursday, February 27, 2025 12:25 AM > To: Cavitt, Jonathan <jonathan.cavitt@xxxxxxxxx> > Cc: intel-xe@xxxxxxxxxxxxxxxxxxxxx; Gupta, saurabhg <saurabhg.gupta@xxxxxxxxx>; Zuo, Alex <alex.zuo@xxxxxxxxx>; joonas.lahtinen@xxxxxxxxxxxxxxx; Zhang, Jianxun <jianxun.zhang@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH 6/6] drm/xe/xe_vm: Implement xe_vm_get_property_ioctl > > > > On Wed, Feb 26, 2025 at 10:55:56PM +0000, Jonathan Cavitt wrote: > > > Add support for userspace to get various properties from a specified VM. > > > The currently supported properties are: > > > > > > - The number of engine resets the VM has observed > > > - The number of exec queue bans the VM has observed, up to the last 50 > > > relevant ones, and how many of those were caused by faults. > > > > > > The latter request also includes information on the exec queue bans, > > > such as the ID of the banned exec queue, whether the ban was caused by a > > > pagefault or not, and the address and address type of the associated > > > fault (if one exists). > > > > > > > > Signed-off-by: Jonathan Cavitt <jonathan.cavitt@xxxxxxxxx> > > > Suggested-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > > --- > [...] > > > > > + > > > +struct drm_xe_ban { > > > + /** @exec_queue_id: ID of banned exec queue */ > > > + __u32 exec_queue_id; > > > > I don't think we can reliably associate a page fault with an > > exec_queue_id at the moment, given my above statement about having to > > capture all state at the time of the page fault. Maybe we could with > > some tricks between the page fault and the IOMMU CAT error G2H? > > Regardless, let's ask the UMD we are targeting [1] if this information > > would be helpful. It would seemingly have to be vendor-specific > > information, not part of the generic Vk information. > > > > Additionally, it might be good to ask what other vendor-specific > > information, if any, we'd need here based on what the current page fault > > interface supports. > > > > [1] https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_device_fault.html > > The original request was something along the lines of having a mirror of the > DRM_IOCTL_I915_GET_RESET_STATS on XeKMD. Those reset stats contain > information on the "context" ID, which maps to the exec queue ID on XeKMD. > > Even if we can't reasonably blame a pagefault on a particular exec queue, in > order to match the request correctly, this information needs to be returned. > > The I915 reset stats also contain information on the number of observed engine > resets, so that needs to be returned as well. > > @joonas.lahtinen@xxxxxxxxxxxxxxx can provide more details. Or maybe > @Mistat, Tomasz . > You haven't really answered my question here or below where you say see above. We a need UMD use case posted with any uAPI changes before merging uAPI changes. I know the above Vk extension is going to be implemented on top of this series but it is very unclear where the number of resets requirement / UMD use case is coming from which makes it impossible to review. Again I suggest focusing on the Vk use case first or go talk to our UMD partners and figure out exactly why something similar to DRM_IOCTL_I915_GET_RESET_STATS is required in Xe. I have made similar comments on VLK-69424. Matt > > > > > + /** @faulted: Whether or not the ban has an associated pagefault. 0 is no, 1 is yes */ > > > + __u32 faulted; > > > + /** @address: Address of the fault, if relevant */ > > > + __u64 address; > > > + /** @address_type: enum drm_xe_fault_address_type, if relevant */ > > > + __u32 address_type; > > > > We likely need a fault_size field to support VkDeviceSize > > addressPrecision; as defined here [2]. I believe we can extract this > > information from pagefault.fault_level. > > > > [2] https://registry.khronos.org/vulkan/specs/latest/man/html/VkDeviceFaultAddressInfoEXT.html > > I can add this field as a prototype, though it will always return SZ_4K until we > can have a longer discussion on how to map between the fault_level and the > fault_size. > > > > > > + /** @pad: MBZ */ > > > + __u32 pad; > > > + /** @reserved: MBZ */ > > > + __u64 reserved[3]; > > > +}; > > > + > > > +struct drm_xe_faults { > > > + /** @num_faults: Number of faults observed on the VM */ > > > + __u32 num_faults; > > > + /** @num_bans: Number of bans observed on the VM */ > > > + __u32 num_bans; > > > > I don't think num_bans and num_faults really provide any benefit for > > supporting [1]. The requirement for [1] is device faults-nothing more. > > With that in mind, I'd lean toward an array of a single structure > > (returned in drm_xe_vm_get_property.data, number of faults can be > > inferred from the returned size) reporting all faults, with each entry > > containing all the fault information. If another use case arises for > > reporting all banned queues, we can add a property for that. > > I'm fairly certain the full ban list was directly requested, but I can break > it into a third query at least. > > Also, the abstraction is done here because that's how copy_from_user > has historically been used. I'd rather not experiment with trying to > copy_from_user a structure array and bungling it, but I guess I can give > it a try at least... > > > > > > + /** @reserved: MBZ */ > > > + __u64 reserved[2]; > > > + /** @list: Dynamic sized array of drm_xe_ban bans */ > > > + struct drm_xe_ban list[]; > > > > list[0] would be the prefered way. > > That is not how dynamic arrays are handled for > struct drm_xe_query_engines, > struct drm_xe_query_mem_regions, > struct drm_xe_query_config, > struct drm_xe_query_gt_list, > struct drm_xe_query_topology_mask, > struct drm_xe_oa_unit, > or > struct drm_xe_query_oa_units > > > > > > +}; > > > + > > > +/** > > > + * struct drm_xe_vm_get_property - Input of &DRM_IOCTL_XE_VM_GET_PROPERTY > > > + * > > > + * The user provides a VM ID and a property to query to this ioctl, > > > + * and the ioctl returns the size of the return value. Calling the > > > + * ioctl again with memory reserved for the data will save the > > > + * requested property data to the data pointer. > > > + * > > > + * The valid properties are: > > > + * - %DRM_XE_VM_GET_PROPERTY_FAULTS : Property is a drm_xe_faults struct of dynamic size > > > + * - %DRM_XE_VM_GET_PROPERTY_NUM_RESETS: Property is a scalar > > > > We need to consider where the number of resets requirement is coming > > from. As far as I know, the Vk extension [1] we are targeting does not > > need this information. I'm unsure about the compute UMD requirements at > > the moment, so let's focus on supporting the Vk extension first. > > > > Any uAPI must also have a UMD component, so focusing on one issue at a > > time makes sense. > > See first reply. > -Jonathan Cavitt > > > > > > + */ > > > +struct drm_xe_vm_get_property { > > > + /** @extensions: Pointer to the first extension struct, if any */ > > > + __u64 extensions; > > > + > > > + /** @vm_id: The ID of the VM to query the properties of */ > > > + __u32 vm_id; > > > + > > > +#define DRM_XE_VM_GET_PROPERTY_FAULTS 0 > > > +#define DRM_XE_VM_GET_PROPERTY_NUM_RESETS 1 > > > + /** @property: The property to get */ > > > + __u32 property; > > > + > > > + /** @size: Size of returned property @data */ > > > + __u32 size; > > > + > > > + /** @pad: MBZ */ > > > + __u32 pad; > > > + > > > + /** @reserved: MBZ */ > > > + __u64 reserved[2]; > > > > I'd put the reserved bits at the end. > > > > > + > > > + /** @data: Pointer storing return data */ > > > + __u64 data; > > > > union { > > __u64 data; > > __u64 ptr; > > }; > > > > We would simply return 'data' for properties that fit in a u64 and > > perform the size=0, return size process for properties that require a > > user allocation. > > > > This may not be relevant at the moment if we drop > > DRM_XE_VM_GET_PROPERTY_NUM_RESET. > > > > Matt > > > > > +}; > > > + > > > /** > > > * struct drm_xe_exec_queue_create - Input of &DRM_IOCTL_XE_EXEC_QUEUE_CREATE > > > * > > > -- > > > 2.43.0 > > > > >