On 2018-03-12 03:37 PM, Daniel Vetter wrote: > On Mon, Mar 12, 2018 at 7:17 PM, Felix Kuehling <felix.kuehling at amd.com> wrote: >> On 2018-03-07 03:34 PM, Felix Kuehling wrote: >>>> Again stop worrying about ioctl overhead, this isn't Windows. If you >>>> can show the overhead as being a problem then address it, but I >>>> think it's premature worrying about it at this stage. >>> I'd like syscall overhead to be small. But with recent kernel page table >>> isolation, NUMA systems and lots of GPUs, I think this may not be >>> negligible. For example we're working with some Intel NUMA systems and 8 >>> GPUs for HPC or deep learning applications. I'll be measuring the >>> overhead on such systems and get back with results in a few days. I want >>> to have an API that can scale to such applications. >> I ran some tests on a 2-socket Xeon E5-2680 v4 with 56 CPU threads and 8 >> Vega10 GPUs. The kernel was 4.16-rc1 based with KPTI enabled and a >> kernel config based on a standard Ubuntu kernel. No debug options were >> enabled. My test application measures KFD memory management API >> performance for allocating, mapping, unmapping and freeing 1000 buffers >> of different sizes (4K, 16K, 64K, 256K) and memory types (VRAM and >> system memory). The impact of ioctl overhead depended on whether the >> page table update was done by CPU or SDMA. >> >> I averaged 10 runs of the application and also calculated the standard >> deviation to see if my results were just random noise. >> >> With SDMA using a single ioctl was about 5% faster for mapping and 10% >> faster for unmapping. The standard deviation was 2.5% and 7.5% respectively. >> >> With CPU a single ioctl was 2.5% faster for mapping, 18% faster for >> unmapping. Standard deviation was 0.2% and 3% respectively. > btw for statistics student's t-distribution is usually the measure to > tell "is this the same distribution or not". Works much more robustly > if you're dealing with odd shapes of your measured distributions, > which can happen easily (e.g. if it bifurcates into a fast vs. > slowpath or similar stuff). > > Also for my understanding: This was 1 ioctl to map 1 buffer on 8 gpus > vs. 8 ioctl to mape 1 buffer on 1 of the 8 gpus? The task is the same in both cases: map one buffer on all 8 GPUs. In one case it uses 9 ioctls (1 map call per GPU and 1 call to synchronize with SDMA and flush GPU TLBs). In the other case it's 1 ioctl doing all those things. > Do we have benchmarks that show overall impact? I'm assuming that your > workloads won't spend all day long mapping/unmapping stuff, but also > will do some computing :-) I don't. This was done with a micro benchmark. In real applications the impact is going to be much smaller. I tested one application that I know does a lot of memory mappings mixed in between computations (lulesh-cl from https://github.com/AMDComputeLibraries/ComputeApps/). But it only maps on one GPU, so the impact was minimal (maybe 1%) and probably not statistically significant. > > Can you also give numbers without KPTI? Afaiui AMD mostly doesn't need > it, and Intel will eventually fix it too, so this overhead should > disappear again. Just want to get a full picture here. Before I got time on the Intel system I ran less rigorous experiments on an AMD Threadripper with KPTI off and KPTI forced on. I don't have exact numbers from those tests. With KPTI off the ioctl overhead was not measurable. With KPTI on it was about the same or slightly bigger than on the Intel system. Regards, Â Felix > -Daniel > >> For unmapping the difference was bigger than mapping because unmapping >> is faster to begin with, so the system call overhead is bigger in >> proportion. Mapping of a single buffer to 8 GPUs takes about 220us with >> SDMA or 190us with CPU with only minor dependence on buffer size and >> memory type. Unmapping takes about 35us with SDMA or 13us with CPU. >> >>> Regards, >>> Felix >>> >>> >> _______________________________________________ >> dri-devel mailing list >> dri-devel at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/dri-devel > >