Re: [PATCH 6/6] drm/xe/xe_vm: Implement xe_vm_get_property_ioctl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 27, 2025 at 09:51:15AM -0700, Cavitt, Jonathan wrote:
> Some responses below.  If I skip over anything, just assume that I'm taking the request
> into consideration and that it will be fixed for version 2 of this patch series.
> 
> -----Original Message-----
> From: Brost, Matthew <matthew.brost@xxxxxxxxx> 
> Sent: Thursday, February 27, 2025 12:25 AM
> To: Cavitt, Jonathan <jonathan.cavitt@xxxxxxxxx>
> Cc: intel-xe@xxxxxxxxxxxxxxxxxxxxx; Gupta, saurabhg <saurabhg.gupta@xxxxxxxxx>; Zuo, Alex <alex.zuo@xxxxxxxxx>; joonas.lahtinen@xxxxxxxxxxxxxxx; Zhang, Jianxun <jianxun.zhang@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH 6/6] drm/xe/xe_vm: Implement xe_vm_get_property_ioctl
> > 
> > On Wed, Feb 26, 2025 at 10:55:56PM +0000, Jonathan Cavitt wrote:
> > > Add support for userspace to get various properties from a specified VM.
> > > The currently supported properties are:
> > > 
> > > - The number of engine resets the VM has observed
> > > - The number of exec queue bans the VM has observed, up to the last 50
> > >   relevant ones, and how many of those were caused by faults.
> > > 
> > > The latter request also includes information on the exec queue bans,
> > > such as the ID of the banned exec queue, whether the ban was caused by a
> > > pagefault or not, and the address and address type of the associated
> > > fault (if one exists).
> > > 
> > 
> > > Signed-off-by: Jonathan Cavitt <jonathan.cavitt@xxxxxxxxx>
> > > Suggested-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > ---
> [...]
> > 
> > > +
> > > +struct drm_xe_ban {
> > > +	/** @exec_queue_id: ID of banned exec queue */
> > > +	__u32 exec_queue_id;
> > 
> > I don't think we can reliably associate a page fault with an
> > exec_queue_id at the moment, given my above statement about having to
> > capture all state at the time of the page fault. Maybe we could with
> > some tricks between the page fault and the IOMMU CAT error G2H?
> > Regardless, let's ask the UMD we are targeting [1] if this information
> > would be helpful. It would seemingly have to be vendor-specific
> > information, not part of the generic Vk information.
> > 
> > Additionally, it might be good to ask what other vendor-specific
> > information, if any, we'd need here based on what the current page fault
> > interface supports.
> > 
> > [1] https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_device_fault.html
> 
> The original request was something along the lines of having a mirror of the
> DRM_IOCTL_I915_GET_RESET_STATS on XeKMD.  Those reset stats contain
> information on the "context" ID, which maps to the exec queue ID on XeKMD.
> 
> Even if we can't reasonably blame a pagefault on a particular exec queue, in
> order to match the request correctly, this information needs to be returned.
> 
> The I915 reset stats also contain information on the number of observed engine
> resets, so that needs to be returned as well.
> 
> @joonas.lahtinen@xxxxxxxxxxxxxxx can provide more details.  Or maybe
> @Mistat, Tomasz .
> 

You haven't really answered my question here or below where you say see
above. We a need UMD use case posted with any uAPI changes before
merging uAPI changes. I know the above Vk extension is going to be
implemented on top of this series but it is very unclear where the
number of resets requirement / UMD use case is coming from which makes
it impossible to review. 

Again I suggest focusing on the Vk use case first or go talk to our UMD
partners and figure out exactly why something similar to
DRM_IOCTL_I915_GET_RESET_STATS is required in Xe. I have made similar
comments on VLK-69424.

Matt

> > 
> > > +	/** @faulted: Whether or not the ban has an associated pagefault.  0 is no, 1 is yes */
> > > +	__u32 faulted;
> > > +	/** @address: Address of the fault, if relevant */
> > > +	__u64 address;
> > > +	/** @address_type: enum drm_xe_fault_address_type, if relevant */
> > > +	__u32 address_type;
> > 
> > We likely need a fault_size field to support VkDeviceSize
> > addressPrecision; as defined here [2]. I believe we can extract this
> > information from pagefault.fault_level.
> > 
> > [2] https://registry.khronos.org/vulkan/specs/latest/man/html/VkDeviceFaultAddressInfoEXT.html
> 
> I can add this field as a prototype, though it will always return SZ_4K until we
> can have a longer discussion on how to map between the fault_level and the
> fault_size.
> 
> > 
> > > +	/** @pad: MBZ */
> > > +	__u32 pad;
> > > +	/** @reserved: MBZ */
> > > +	__u64 reserved[3];
> > > +};
> > > +
> > > +struct drm_xe_faults {
> > > +	/** @num_faults: Number of faults observed on the VM */
> > > +	__u32 num_faults;
> > > +	/** @num_bans: Number of bans observed on the VM */
> > > +	__u32 num_bans;
> > 
> > I don't think num_bans and num_faults really provide any benefit for
> > supporting [1]. The requirement for [1] is device faults-nothing more.
> > With that in mind, I'd lean toward an array of a single structure
> > (returned in drm_xe_vm_get_property.data, number of faults can be
> > inferred from the returned size) reporting all faults, with each entry
> > containing all the fault information. If another use case arises for
> > reporting all banned queues, we can add a property for that.
> 
> I'm fairly certain the full ban list was directly requested, but I can break
> it into a third query at least.
> 
> Also, the abstraction is done here because that's how copy_from_user
> has historically been used.  I'd rather not experiment with trying to
> copy_from_user a structure array and bungling it, but I guess I can give
> it a try at least...
> 
> > 
> > > +	/** @reserved: MBZ */
> > > +	__u64 reserved[2];
> > > +	/** @list: Dynamic sized array of drm_xe_ban bans */
> > > +	struct drm_xe_ban list[];
> > 
> > list[0] would be the prefered way.
> 
> That is not how dynamic arrays are handled for
> struct drm_xe_query_engines,
> struct drm_xe_query_mem_regions,
> struct drm_xe_query_config,
> struct drm_xe_query_gt_list,
> struct drm_xe_query_topology_mask,
> struct drm_xe_oa_unit,
> or
> struct drm_xe_query_oa_units
> 
> > 
> > > +};
> > > +
> > > +/**
> > > + * struct drm_xe_vm_get_property - Input of &DRM_IOCTL_XE_VM_GET_PROPERTY
> > > + *
> > > + * The user provides a VM ID and a property to query to this ioctl,
> > > + * and the ioctl returns the size of the return value.  Calling the
> > > + * ioctl again with memory reserved for the data will save the
> > > + * requested property data to the data pointer.
> > > + *
> > > + * The valid properties are:
> > > + *  - %DRM_XE_VM_GET_PROPERTY_FAULTS : Property is a drm_xe_faults struct of dynamic size
> > > + *  - %DRM_XE_VM_GET_PROPERTY_NUM_RESETS: Property is a scalar
> > 
> > We need to consider where the number of resets requirement is coming
> > from. As far as I know, the Vk extension [1] we are targeting does not
> > need this information. I'm unsure about the compute UMD requirements at
> > the moment, so let's focus on supporting the Vk extension first.
> > 
> > Any uAPI must also have a UMD component, so focusing on one issue at a
> > time makes sense.
> 
> See first reply.
> -Jonathan Cavitt
> 
> > 
> > > + */
> > > +struct drm_xe_vm_get_property {
> > > +	/** @extensions: Pointer to the first extension struct, if any */
> > > +	__u64 extensions;
> > > +
> > > +	/** @vm_id: The ID of the VM to query the properties of */
> > > +	__u32 vm_id;
> > > +
> > > +#define DRM_XE_VM_GET_PROPERTY_FAULTS		0
> > > +#define DRM_XE_VM_GET_PROPERTY_NUM_RESETS	1
> > > +	/** @property: The property to get */
> > > +	__u32 property;
> > > +
> > > +	/** @size: Size of returned property @data */
> > > +	__u32 size;
> > > +
> > > +	/** @pad: MBZ */
> > > +	__u32 pad;
> > > +
> > > +	/** @reserved: MBZ */
> > > +	__u64 reserved[2];
> > 
> > I'd put the reserved bits at the end.
> > 
> > > +
> > > +	/** @data: Pointer storing return data */
> > > +	__u64 data;
> > 
> > union {
> > 	__u64 data;
> > 	__u64 ptr;
> > };
> > 
> > We would simply return 'data' for properties that fit in a u64 and
> > perform the size=0, return size process for properties that require a
> > user allocation.
> > 
> > This may not be relevant at the moment if we drop
> > DRM_XE_VM_GET_PROPERTY_NUM_RESET.
> > 
> > Matt
> > 
> > > +};
> > > +
> > >  /**
> > >   * struct drm_xe_exec_queue_create - Input of &DRM_IOCTL_XE_EXEC_QUEUE_CREATE
> > >   *
> > > -- 
> > > 2.43.0
> > > 
> > 



[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux