The following set of patches will improve the performance of blit-copy functions for Radeon GPUs based on R600, R700, Evergreen and NI ASICs. The foundation for improvement is the use of tiled mode access (which for copying bo's can be used regardless of whether the content is tiled or not), and segmenting the memory block being copied into rectangles whose edge ratio is between 1:1 and 1:2. This maximizes the number of PCIe transactions that use maximum payload size (typically 128 bytes) and also creates a memory access pattern that is more favorable for both VRAM and host DRAM than what's currently in the kernel. To come up with the new blit-copy code, I did a lot of PCIe traffic analysis with the bus analyzer and also had many discussions with Alex, trying to explain what's going on (thanks to Alex for his time). Below (at the end of this note) are the results of some benchmarks that I did with various GPUs (all in the same host: Intel i7 CPU, X58 chipset, three DRAM channels). To run the tests on your machine load the radeon module with 'benchmark=1 pcie_gen2=1' parameters. Most significant improvement is in the upstream (VRAM to GART) direction because that's where the PCIe transactions were fragmented and also where memory access pattern was such that it created a lot of backpressure from the host. It is also interesting that high-end devices (e.g. Cayman) exhibit the least improvement and were the worst to begin with. This is because high-end devices copy more tiles in parallel which in turn can create bank conflicts on host memory and cause the host to do lots of bank-close/precharge/bank-open cycles. As an added "bonus", I also did some code cleanup and consolidated the repeated code into common function, so r600 and evergreen/NI parts now share the blit-copy code. I also expanded on the benchmark coverage, so the module now takes benckmark parameter value between 1 and 8 and each results in running a different benchmark. For details, see the commit log messages and the code. I have been running with these patches for a few months (and I kept rebasing them to drm-core-next as the public git progressed) and I used them in a system setup that does *many* copying of this kind (and does them frequently); I have not seen instabilities introduced by these patches. I also verified the correctness of the copy using test=1 parameter for each GPU that I had and the test passed. I would welcome some feedback and if you run the benchmarks with the new blit code, I would very much like to hear what kind of improvement you are seeing. BENCHMARK RESULTS: ================== 1) VRAM to GTT ============== Card (ASIC) VRAM Before After --------------------------------------------- 5570 (Redwood) DDR3 1600MHZ 454 3912 6450 (Caicos) DDR5 3200MHz 3718 5090 6570 (Turks) DDR3 1800MHz 484 4144 5450 (Cedar) DDR3 1600MHz 3679 5090 5450 (Cedar) DDR2 800MHz 2695 4639 E4690 (RV730) DDR3 1400MHZ 485 4969 E6760 (Turks) DDR5 3200MHz 474 4177 V5700 (RV730) DDR3 ????MHz 488 4297 2260 (RV620) DDR2 ????MHz 494 3093 6870 (Barts) DDR5 4200MHz 475 1113 6970 (Cayman) DDR5 4200MHz 473 710 2) GTT to VRAM ============== Card (ASIC) VRAM Before After --------------------------------------------- 5570 (Redwood) DDR3 1600MHz 3158 3360 6450 (Caicos) DDR5 3200MHz 2995 3393 6570 (Turks) DDR3 1800MHz 3039 3339 5450 (Cedar) DDR3 1600MHz 3246 3404 5450 (Cedar) DDR2 800MHz 2614 3371 E4690 (RV730) DDR3 1400MHz 3084 3426 E6760 (Turks) DDR5 3200MHz 2443 2570 V5700 (RV730) DDR3 ????MHz 3187 3506 2260 (RV620) DDR2 ????MHz 584 3246 6870 (Barts) DDR5 4200MHz 2472 2601 6970 (Cayman) DDR5 4200MHz 2460 2737 _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel