Re: [PATCH v2 07/40] iommu: Add a page fault handler

Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx> · Tue, 22 May 2018 16:35:21 -0700



On Mon, 21 May 2018 15:49:40 +0100
Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx> wrote:

> On 18/05/18 19:04, Jacob Pan wrote:
> >> +struct iopf_context {
> >> +	struct device			*dev;
> >> +	struct iommu_fault_event	evt;
> >> +	struct list_head		head;
> >> +};
> >> +
> >> +struct iopf_group {
> >> +	struct iopf_context		last_fault;
> >> +	struct list_head		faults;
> >> +	struct work_struct		work;
> >> +};
> >> +  
> > All the page requests in the group should belong to the same device,
> > perhaps struct device tracking should be in iopf_group instead of
> > iopf_context?  
> 
> Right, this is a leftover from when we kept a global list of partial
> faults. Since the list is now per-device, I can move the dev pointer
> (I think I should also rename iopf_context to iopf_fault, if that's
> all right)
> 
> >> +static int iopf_complete(struct device *dev, struct
> >> iommu_fault_event *evt,
> >> +			 enum page_response_code status)
> >> +{
> >> +	struct page_response_msg resp = {
> >> +		.addr			= evt->addr,
> >> +		.pasid			= evt->pasid,
> >> +		.pasid_present		= evt->pasid_valid,
> >> +		.page_req_group_id	=
> >> evt->page_req_group_id,
> >> +		.private_data		= evt->iommu_private,
> >> +		.resp_code		= status,
> >> +	};
> >> +
> >> +	return iommu_page_response(dev, &resp);
> >> +}
> >> +
> >> +static enum page_response_code
> >> +iopf_handle_single(struct iopf_context *fault)
> >> +{
> >> +	/* TODO */
> >> +	return -ENODEV;
> >> +}
> >> +
> >> +static void iopf_handle_group(struct work_struct *work)
> >> +{
> >> +	struct iopf_group *group;
> >> +	struct iopf_context *fault, *next;
> >> +	enum page_response_code status = IOMMU_PAGE_RESP_SUCCESS;
> >> +
> >> +	group = container_of(work, struct iopf_group, work);
> >> +
> >> +	list_for_each_entry_safe(fault, next, &group->faults,
> >> head) {
> >> +		struct iommu_fault_event *evt = &fault->evt;
> >> +		/*
> >> +		 * Errors are sticky: don't handle subsequent
> >> faults in the
> >> +		 * group if there is an error.
> >> +		 */  
> > There might be performance benefit for certain device to continue in
> > spite of error in that the ATS retry may have less fault. Perhaps a
> > policy decision for later enhancement.  
> 
> Yes I think we should leave it to future work. My current reasoning is
> that subsequent requests are more likely to fail as well and reporting
> the error is more urgent, but we need real workloads to confirm this.
> 
> >> +		if (status == IOMMU_PAGE_RESP_SUCCESS)
> >> +			status = iopf_handle_single(fault);
> >> +
> >> +		if (!evt->last_req)
> >> +			kfree(fault);
> >> +	}
> >> +
> >> +	iopf_complete(group->last_fault.dev,
> >> &group->last_fault.evt, status);
> >> +	kfree(group);
> >> +}
> >> +
> >> +/**
> >> + * iommu_queue_iopf - IO Page Fault handler
> >> + * @evt: fault event
> >> + * @cookie: struct device, passed to
> >> iommu_register_device_fault_handler.
> >> + *
> >> + * Add a fault to the device workqueue, to be handled by mm.
> >> + */
> >> +int iommu_queue_iopf(struct iommu_fault_event *evt, void *cookie)
> >> +{
> >> +	struct iopf_group *group;
> >> +	struct iopf_context *fault, *next;
> >> +	struct iopf_device_param *iopf_param;
> >> +
> >> +	struct device *dev = cookie;
> >> +	struct iommu_param *param = dev->iommu_param;
> >> +
> >> +	if (WARN_ON(!mutex_is_locked(&param->lock)))
> >> +		return -EINVAL;
> >> +
> >> +	if (evt->type != IOMMU_FAULT_PAGE_REQ)
> >> +		/* Not a recoverable page fault */
> >> +		return IOMMU_PAGE_RESP_CONTINUE;
> >> +
> >> +	/*
> >> +	 * As long as we're holding param->lock, the queue can't
> >> be unlinked
> >> +	 * from the device and therefore cannot disappear.
> >> +	 */
> >> +	iopf_param = param->iopf_param;
> >> +	if (!iopf_param)
> >> +		return -ENODEV;
> >> +
> >> +	if (!evt->last_req) {
> >> +		fault = kzalloc(sizeof(*fault), GFP_KERNEL);
> >> +		if (!fault)
> >> +			return -ENOMEM;
> >> +
> >> +		fault->evt = *evt;
> >> +		fault->dev = dev;
> >> +
> >> +		/* Non-last request of a group. Postpone until the
> >> last one */
> >> +		list_add(&fault->head, &iopf_param->partial);
> >> +
> >> +		return IOMMU_PAGE_RESP_HANDLED;
> >> +	}
> >> +
> >> +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> >> +	if (!group)
> >> +		return -ENOMEM;
> >> +
> >> +	group->last_fault.evt = *evt;
> >> +	group->last_fault.dev = dev;
> >> +	INIT_LIST_HEAD(&group->faults);
> >> +	list_add(&group->last_fault.head, &group->faults);
> >> +	INIT_WORK(&group->work, iopf_handle_group);
> >> +
> >> +	/* See if we have partial faults for this group */
> >> +	list_for_each_entry_safe(fault, next,
> >> &iopf_param->partial, head) {
> >> +		if (fault->evt.page_req_group_id ==
> >> evt->page_req_group_id)  
> > If all 9 bit group index are used, there can be lots of PRGs. For
> > future enhancement, maybe we can have per group partial list ready
> > to go when LPIG comes in? I will be less searching.  
> 
> Yes, allocating the PRG from the start could be more efficient. I
> think it's slightly more complicated code so I'd rather see
> performance numbers before implementing it
> 
> >> +			/* Insert *before* the last fault */
> >> +			list_move(&fault->head, &group->faults);
> >> +	}
> >> +  
> > If you already sorted the group list with last fault at the end,
> > why do you need a separate entry to track the last fault?  
> 
> Not sure I understand your question, sorry. Do you mean why the
> iopf_group.last_fault? Just to avoid one more kzalloc.
> 
kind of :) what i thought was that why can't the last_fault naturally
be the last entry in your group fault list? then there is no need for
special treatment in terms of allocation of the last fault. just my
preference.
> >> +
> >> +	queue->flush(queue->flush_arg, dev);
> >> +
> >> +	/*
> >> +	 * No need to clear the partial list. All PRGs containing
> >> the PASID that
> >> +	 * needs to be decommissioned are whole (the device driver
> >> made sure of
> >> +	 * it before this function was called). They have been
> >> submitted to the
> >> +	 * queue by the above flush().
> >> +	 */  
> > So you are saying device driver need to make sure LPIG PRQ is
> > submitted in the flush call above such that partial list is
> > cleared?  
> 
> Not exactly, it's the IOMMU driver that makes sure all LPIG in its
> queues are submitted by the above flush call. In more details the
> flow is:
> 
> * Either device driver calls unbind()/sva_device_shutdown(), or the
> process exits.
> * If the device driver called, then it already told the device to stop
> using the PASID. Otherwise we use the mm_exit() callback to tell the
> device driver to stop using the PASID.
> * In either case, when receiving a stop request from the driver, the
> device sends the LPIGs to the IOMMU queue.
> * Then, the flush call above ensures that the IOMMU reports the LPIG
> with iommu_report_device_fault.
> * While submitting all LPIGs for this PASID to the work queue,
> ipof_queue_fault also picked up all partial faults, so the partial
> list is clean.
> 
> Maybe I should improve this comment?
> 
thanks for explaining. LPIG submission is done by device asynchronously
w.r.t. driver stopping/decommission PASID. so if we were to use this
flow on vt-d, which does not stall page request queue, then we should
use the iommu model specific flush() callback to ensure LPIG is
received? There could be queue full condition and retry. I am just
trying to understand how and where we can make sure LPIG is
received and all groups are whole.

> > what if
> > there are device failures where device needs to reset (either whole
> > function level or PASID level), should there still be a need to
> > clear the partial list for all associated PASID/group?  
> 
> I guess the simplest way out, when resetting the device, is for the
> device driver to call iommu_sva_device_shutdown(). Both the IOMMU
> driver's sva_device_shutdown() and remove_device() ops should call
> iopf_queue_remove_device(), which clears the partial list.
> 
yes. I think that should work for functional reset.
> Thanks,
> Jean

[Jacob Pan]