Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).

Jerome Glisse <j.glisse@xxxxxxxxx> · Wed, 7 May 2014 08:39:49 -0400

On Wed, May 07, 2014 at 05:14:52PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 12:18 -0400, Jerome Glisse wrote:
> > 
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there. That's all i wanted to
> > stress, i did not wanted force using mmu_notifier, i am fine with them
> > becoming atomic as long as i have a place where i can intercept cpu
> > page table update and propagate them to device mmu.
> 
> Your MMU notifier can maintain a map of "dirty" PTEs and you do the
> actual synchronization in the subsequent flush_tlb_* , you need to add
> hooks there but it's much less painful than in the notifiers.

Well getting back the dirty info from the GPU also require to sleep. Maybe
i should explain how it is suppose to work. GPU have several command buffer
and execute instructions inside those command buffer in sequential order.
To update the GPU mmu you need to schedule command into one of those command
buffer but when you do so you do not know how much command are in front of
you and how long it will take to the GPU to get to your command.

Yes GPU this patchset target have preemption but it is not as flexible as
CPU preemption there is not kernel thread running and scheduling, all the
scheduling is done in hardware. So the preemption is more limited that on
CPU.

That is why any update or information retrieval from the GPU need to go
through some command buffer and no matter how high priority the command
buffer for mmu update is, it can still long time (think flushing thousand
of GPU thread and saving there context).

> 
> *However* Linus, even then we can't sleep. We do things like
> ptep_clear_flush() that need the PTL and have the synchronous flush
> semantics.
> 
> Sure, today we wait, possibly for a long time, with IPIs, but we do not
> sleep. Jerome would have to operate within a similar context. No sleep
> for you :)
> 
> Cheers,
> Ben.

So for the ptep_clear_flush my idea is to have a special lru for page that
are in use by the GPU. This will prevent the page reclaimation try_to_unmap
and thus the ptep_clear_flush. I would block ksm so again another user that
would no do ptep_clear_flush. I would need to fix remap_file_pages either
adding some callback there or refactor the unmap and tlb flushing.

Finaly for page migration i see several solutions, forbid it (easy for me
but likely not what we want) have special code inside migrate code to handle
page in use by a device, or have special code inside try_to_unmap to handle
it.

I think this is all the current user of ptep_clear_flush and derivative that
does flush tlb while holding spinlock.

Note that for special lru or event special handling of page in use by a device
i need a new page flag. Would this be acceptable ?

For the special lru i was thinking of doing it per device as anyway each device
is unlikely to constantly address all the page it has mapped. Simple lru list
would do and probably offering some helper for device driver to mark page accessed
so page frequently use are not reclaim.

But a global list is fine as well and simplify the case diffirent device use
same pages.

Cheers,
Jérôme Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>