Re: [LSF/MM ATTEND] HMM (heterogeneous memory manager) and GPU

Oded Gabbay <oded.gabbay@xxxxxxxxx> · Wed, 3 Feb 2016 13:41:53 +0200

On Wed, Feb 3, 2016 at 1:35 PM, David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> On Wed, 2016-02-03 at 13:07 +0200, Oded Gabbay wrote:
>> > Another, perhaps trivial, question.
>> > When there is an address fault, who handles it ? the SVM driver, or
>> > each device driver ?
>> >
>> > In other words, is the model the same as (AMD) IOMMU where it binds
>> > amd_iommu driver to the IOMMU H/W, and that driver (amd_iommu/v2) is
>> > the only one which handles the PPR events ?
>> >
>> > If that is the case, then with SVM, how will the device driver be made
>> > aware of faults, if the SVM driver won't notify him about them,
>> > because it has already severed the connection between PASID and
>> > process ?
>
> In the ideal case, there's no need for the device driver to get
> involved at all. When a page isn't found in the page tables, the IOMMU
> code calls handle_mm_fault() and either populates the page and sends a
> a 'success' response, or sends an 'invalid fault' response back.
>
> To account for broken hardware, we *have* added a callback into the
> device driver when these faults happen. Ideally it should never be
> used, of course.
>
> In the case where the process has gone away, the PASID is still
> assigned and we still hold mm_count on the MM, just not mm_users. This
> callback into the device driver still occurs if a fault happens during
> process exit between the exit_mm() and exit_files() stage.
>
>> And another question, if I may, aren't you afraid of "false positive"
>> prints to dmesg ? I mean, I'm pretty sure page faults / pasid faults
>> errors will be logged somewhere, probably to dmesg. Aren't you
>> concerned of the users seeing those errors and thinking they may have
>> a bug, while actually the errors were only caused by process
>> termination ?
>
> If that's the case, it's easy enough to silence them. We are already
> explicitly testing for the 'defunct mm' case in our fault handler, to
> prevent us from faulting more pages into an obsolescent MM after its
> mm_users reaches zero and its page tables are supposed to have been
> torn down. That's the 'if(!atomic_inc_not_zere(&svm->mm->mm_users))
> goto bad_req;' part.
>
>> Or in that case you say that the application is broken, because if it
>> still had something running in the H/W, it should not have closed
>> itself ?
>
> That's also true but it's still nice to avoid confusion. Even if only
> to disambiguate cause and effect — we don't want people to see PASID
> faults which were caused by the process crashing, and to think that
> they might be involved in *causing* that process to crash...

Yes, that's why in our model, we aim to kill all running waves
*before* the amd_iommu_v2 driver unbinds the PASID.

>
> --
> David Woodhouse                            Open Source Technology Centre
> David.Woodhouse@xxxxxxxxx                              Intel Corporation
>

It seems you have most of your bases covered. I'll stop harassing you now :)
But in seriousness, its interesting to see the different approaches
taken to handling pretty much the same type of H/W (IOMMU).

Thanks for your patience in answering my questions.

Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href