Re: [PATCH v2 01/40] iommu: Introduce Shared Virtual Addressing API

Christian König <christian.koenig@xxxxxxx> · Wed, 12 Sep 2018 14:56:00 +0200

Am 12.09.2018 um 14:40 schrieb Jean-Philippe Brucker:
On 08/09/2018 08:29, Christian König wrote:
Yes, exactly. I just need a PASID which is never used by the OS for a
process and we can easily give that back when the last FD reference is
closed.
Alright, iommu-sva can get its PASID from this external allocator as
well, as long as it has an interface similar to idr. Where would it go,
drivers/base/, mm/, kernel/...?

Good question, my initial instinct was to put it under drivers/pci.

But AFAIKS now you are supporting SVA implementations which are not 
based on PCI.

So drivers/base sounds like a good place to me.

The process dies, iommu-sva is notified and calls the mm_exit()
function passed by the device driver to iommu_sva_device_init(). In
mm_exit() the device driver needs to clear any reference to the
PASID in hardware and in its own structures. When the device driver
returns from mm_exit(), it effectively tells the core that it has
finished using the PASID, and iommu-sva can reuse the PASID for
another process. mm_exit() is allowed to block, so the device
driver has time to clean up and flush the queues.

If the device driver finishes using the PASID before the process
exits, it just calls unbind().
Exactly that's what Michal Hocko is probably going to not like at all.

Can we have a different approach where each driver is informed by the
mm_exit(), but needs to explicitly call unbind() before a PASID is
reused?
It's awful from the IOMMU driver perspective. In addition to "enabled"
and "disabled" PASID states, you add "disabled but DMA still running
normally". Between that new state and "disabled", the IOMMU will be
flooded by translation faults (non-recoverable ones), which it needs to
ignore instead of reporting to the kernel. Not all IOMMUs can deal with
this in hardware (SMMU and VT-d can quiesce translation faults
per-PASID, but I don't think AMD IOMMU can.) Some drivers will have to
filter fault events themselves, depending on the PASID state.

Puh, yeah that is probably true.

Ok let us skip that for a moment, we just need to invest more work in 
killing DMA operations quickly when the process address space is teared 
down.

During that teardown transition it would be ideal if that PASID only
points to a dummy root page directory with only invalid entries.

I guess this can be vendor specific, In VT-d I plan to mark PASID
entry not present and disable fault reporting while draining remaining
activities.
Sounds good to me.

Point is at least in the case where the process was killed by the OOM
killer we should not block in mm_exit().

Instead operations issued by the process to a device driver which uses
SVA needs to be terminated as soon as possible to make sure that the OOM
killer can advance.
I don't see how we're preventing the OOM killer from advancing, so I'm
looking for a stronger argument that justifies adding this complexity to
IOMMU drivers. Time limit of the release MMU notifier, locking
requirement, a concrete example where things break, even a comment
somewhere in mm/ would do...

In my tests I can't manage to disturb the OOM killer, but I could be
missing some special case. Even if the mm_exit() callback
(unrealistically) sleeps for 60 seconds,

Well you are *COMPLETELY* under estimating this. A compute task with a 
huge wave launch can take multiple minutes to tear down.

That's why I'm so concerned about that, but to be honest I think that 
just the hardware needs to become better and we need to be able to block 
dead tasks from spawning threads again.

Regards,
Christian.

  the OOM killer isn't affected
because oom_reap_task_mm() wipes the victim's address space in another
thread, either before or while the release notifier is running.

Thanks,
Jean