On 17-05-12 04:43 AM, Christian König wrote: > Am 12.05.2017 um 10:37 schrieb zhoucm1: >> >> >> >> If the sdma is faster, even they wait for finish, which time is >> shorter than CPU, isn't it? Of course, the precondition is sdma is >> exclusive. They can reserve a sdma for PT updating. >> > > No, if I understood Felix numbers correctly the setup and wait time > for SDMA is a bit (but not much) longer than doing it with the CPU. I'm skeptical of claims that SDMA is faster. Even when you use SDMA to write the page table, the CPU still has to do about the same amount of work writing PTEs into the SDMA IBs. SDMA can only save CPU time in certain cases: * Copying PTEs from GART table if they are on the same GPU (not possible on Vega10 due to different MTYPE bits) * Generating PTEs for contiguous VRAM BOs At least for system memory BOs writing the PTEs directly to write-combining VRAM should be faster than writing them to cached system memory IBs first and then kicking off an SDMA transfer and waiting for completion. > > What would really help is to fix the KFD design and work with async > page tables updates there as well. That problem goes much higher up the stack than just KFD. It would affect memory management interfaces in the HSA runtime and HCC. The basic idea is to make the GPU behave very similar to a CPU and to have multi-threaded code where some threads run on the CPU and others on the GPU almost seamlessly. You allocate memory and then you use the same pointer in your CPU and GPU threads. Exposing the messiness of asynchronous page table updates all the way up to the application would destroy that programming model. In this model, latency matters most. The longer it takes to kick off a parallel GPU processing job, the less efficient scaling you get from the GPUs parallel processing capabilities. Exposing asynchronous memory management up the stack would allow the application to hide the latency in some cases (if it can do other useful things in the mean time), but it doesn't make the latency disappear. An application that wants to hide memory management latency can do this, even with the existing programming model, by separating memory management and processing into separate threads. Regards, Felix