On Thu, Apr 23, 2015 at 09:38:15AM -0500, Christoph Lameter wrote: > On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote: [...] > > You have something in memory, whether you got it via malloc, mmap'ing a file, > > shmem with some other application, ... and you want to work on it with the > > co-processor that is residing in your address space. Even better, pass a pointer > > to it to some library you don't control which might itself want to use the > > coprocessor .... > > Yes that works already. Whats new about this? This seems to have been > solved on the Intel platform f.e. No this not have been solve properly. Today solution is doing an explicit copy and again and again when complex data struct are involve (list, tree, ...) this is extremly tedious and hard to debug. So today solution often restrict themself to easy thing like matrix multiplication. But if you provide a unified address space then you make things a lot easiers for a lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry standard is a proof that unified address space is one of the most important feature requested by user of GPGPU. You might not care but the rest of the world does. > > > What you propose can simply not provide that natural usage model with any > > efficiency. > > There is no effiecency anymore if the OS can create random events in a > computational stream that is highly optimized for data exchange of > multiple threads at defined time intervals. If transparency or the natural > usage model can avoid this then ok but what I see here proposed is some > behind-the-scenes model that may severely degrate performance. And this > does seem to go way beyond CAPI. At leasdt the way I so far thought about > this as a method for cache coherency at the cache line level and about a > way to simplify the coordination of page tables and TLBs across multiple > divergent architectures. Again you restrict yourself to your usecase. Many HPC workload do not have stringent time constraint and synchronization point. > > I think these two things need to be separated. The shift-the-memory-back- > and-forth approach should be separate and if someone wants to use the > thing then it should also work on other platforms like ARM and Intel. What IBM does with there platform is there choice, they can not force ARM or Intel or AMD to do the same. Each of those might have different view on what is their most important target. For instance i highly doubt ARM cares about any of this. > > CAPI needs to be implemented as a way to potentially improve the existing > communication paths between devices and the main processor. F.e the > existing Infiniband MMU synchronization issues and RDMA registration > problems could be addressed with this. The existing mechanisms for GPU > communication could become much cleaner and easier to handle. This is all > good but independant of any "transparent" memory implementation. No, transparent memory implementation is a prerequisite to leverage to cache coherency. If address for a same process does not means the same thing on a device that on the CPU then doing cache coherency becomes a lot harder because you need to track several address for same physical backing storage. N (virtual) to 1 (physical) mapping is hard. Same address on the other hand means that it is lot easier to have cache coherency distributed accross device and CPU because they will all agree on what physical memory is backing each address of a given process. 1 (virtual) to 1 (physical) is easier. > > It might not be *your* model based on *your* application but that doesn't mean > > it's not there, and isn't relevant. > > Sadly this is the way that an entire industry does its thing. Again no, you are wrong, the HPC industry is not only about latency. Only time critical application care about latency, everyone else cares about throughput, where the applications can runs for days, weeks, months before producing any useable/meaningfull results. Many of which do not care a tiny bit about latency because they can perform independant computation. Take a company rendering a movie for instance, they want to render the millions of frame as fast as possible but each frame can be rendered independently, they only share data is the input geometry, textures and lighting but this are constant, the rendering of one frame does not depend on the rendering of the previous (leaving post processing like motion blur aside). Same apply if you do some data mining. You want might want to find all occurence of a specific sequence in a large data pool. You can slice your data pool and have an independant job per slice and only aggregate the result of each jobs at the end (or as they finish). I will not go on and on and on about all the thing that do not care about latency, i am just trying to open your eyes on the world that exist out there. Cheers, Jérôme -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>