Re: [LSF/MM ATTEND] HMM (heterogeneous memory manager) and GPU

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 03 Feb 2016 11:35:50 +0000

On Wed, 2016-02-03 at 13:07 +0200, Oded Gabbay wrote:
> > Another, perhaps trivial, question.
> > When there is an address fault, who handles it ? the SVM driver, or
> > each device driver ?
> >
> > In other words, is the model the same as (AMD) IOMMU where it binds
> > amd_iommu driver to the IOMMU H/W, and that driver (amd_iommu/v2) is
> > the only one which handles the PPR events ?
> >
> > If that is the case, then with SVM, how will the device driver be made
> > aware of faults, if the SVM driver won't notify him about them,
> > because it has already severed the connection between PASID and
> > process ?

In the ideal case, there's no need for the device driver to get
involved at all. When a page isn't found in the page tables, the IOMMU
code calls handle_mm_fault() and either populates the page and sends a
a 'success' response, or sends an 'invalid fault' response back.

To account for broken hardware, we *have* added a callback into the
device driver when these faults happen. Ideally it should never be
used, of course.

In the case where the process has gone away, the PASID is still
assigned and we still hold mm_count on the MM, just not mm_users. This
callback into the device driver still occurs if a fault happens during
process exit between the exit_mm() and exit_files() stage.

> And another question, if I may, aren't you afraid of "false positive"
> prints to dmesg ? I mean, I'm pretty sure page faults / pasid faults
> errors will be logged somewhere, probably to dmesg. Aren't you
> concerned of the users seeing those errors and thinking they may have
> a bug, while actually the errors were only caused by process
> termination ?

If that's the case, it's easy enough to silence them. We are already
explicitly testing for the 'defunct mm' case in our fault handler, to
prevent us from faulting more pages into an obsolescent MM after its
mm_users reaches zero and its page tables are supposed to have been
torn down. That's the 'if(!atomic_inc_not_zere(&svm->mm->mm_users))
goto bad_req;' part.

> Or in that case you say that the application is broken, because if it
> still had something running in the H/W, it should not have closed
> itself ?

That's also true but it's still nice to avoid confusion. Even if only
to disambiguate cause and effect — we don't want people to see PASID
faults which were caused by the process crashing, and to think that
they might be involved in *causing* that process to crash...

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@xxxxxxxxx                              Intel Corporation

Attachment:
smime.p7s

Description: S/MIME cryptographic signature