Re: Slow memory access when using OpenCL without X11

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Lauri,

Thanks for your persistence. Seeing that this is reproducible on several boards with up-to-date BIOS is really helpful and gives me some confidence that it's more than a weird vendor or board-specific corner case and that we should be able to reproduce it. Yong is going to start looking into this problem.

Regards,
  Felix

On 3/14/2019 12:41 PM, Lauri Ehrenpreis wrote:
Yes it affects this a bit but it doesn't get the speed up to "normal" level. I got best results with "profile_peak" - then the memcpy speed on CPU is 1/3 of what it is without opencl initialization: 

 echo "profile_peak" > /sys/class/drm/card0/device/power_dpm_force_performance_level
./cl_slow_test 1 5
got 1 platforms 1 devices
speed 3710.360352 avg 3710.360352 mbytes/s
speed 3713.660400 avg 3712.010254 mbytes/s
speed 3797.630859 avg 3740.550537 mbytes/s
speed 3708.004883 avg 3732.414062 mbytes/s
speed 3796.403076 avg 3745.211914 mbytes/s

Without calling clCreateContext:
./cl_slow_test 0 5
speed 7299.201660 avg 7299.201660 mbytes/s
speed 9298.841797 avg 8299.021484 mbytes/s
speed 9360.181641 avg 8652.742188 mbytes/s
speed 9004.759766 avg 8740.746094 mbytes/s
speed 9414.607422 avg 8875.518555 mbytes/s

--
Lauri

On Thu, Mar 14, 2019 at 5:46 PM Ernst Sjöstrand <ernstp@xxxxxxxxx> wrote:
Does
echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
or setting cpu scaling governor to performance affect it at all?

Regards
//Ernst

Den tors 14 mars 2019 kl 14:31 skrev Lauri Ehrenpreis <laurioma@xxxxxxxxx>:
>
> I tried also with those 2 boards now:
> https://www.asrock.com/MB/AMD/Fatal1ty%20B450%20Gaming-ITXac/index.asp
> https://www.msi.com/Motherboard/B450I-GAMING-PLUS-AC
>
> Both are using latest BIOS, ubuntu 18.10, kernel https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0.2/
>
> There are some differences in dmesg (asrock has some amdgpu assert in dmesg) but otherwise results are exactly the same.
> In desktop env cl_slow_test works fast, over ssh terminal it doesn't. If i move mouse then it starts working fast in terminal as well.
>
> So one can't use OpenCL without monitor and desktop env running and this happens with 2 different chipsets (b350 & b450), latest bios from 3 different vendors, latest kernel and latest rocm. This doesn't look like edge case with unusual setup to me..
>
> Attached dmesg, dmidecode, and clinfo from both boards.
>
> --
> Lauri
>
> On Wed, Mar 13, 2019 at 10:15 PM Lauri Ehrenpreis <laurioma@xxxxxxxxx> wrote:
>>
>> For reproduction only the tiny cl_slow_test.cpp is needed which is attached to first e-mail.
>>
>> System information is following:
>> CPU: Ryzen5 2400G
>> Main board: Gigabyte AMD B450 AORUS mini itx: https://www.gigabyte.com/Motherboard/B450-I-AORUS-PRO-WIFI-rev-10#kf
>> BIOS: F5 8.47 MB 2019/01/25 (latest)
>> Kernel: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0/  (amd64)
>> OS: Ubuntu 18.04 LTS
>> rocm-opencl-dev installation:
>> wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
>> echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
>> sudo apt install rocm-opencl-dev
>>
>> Also exactly the same issue happens with this board: https://www.gigabyte.com/Motherboard/GA-AB350-Gaming-3-rev-1x#kf
>>
>> I have MSI and Asrock mini itx boards ready as well, So far didn't get amdgpu & opencl working there but I'll try again tomorrow..
>>
>> --
>> Lauri
>>
>>
>> On Wed, Mar 13, 2019 at 8:51 PM Kuehling, Felix <Felix.Kuehling@xxxxxxx> wrote:
>>>
>>> Hi Lauri,
>>>
>>> I still think the SMU is doing something funny, but rocm-smi isn't
>>> showing enough information to really see what's going on.
>>>
>>> On APUs the SMU firmware is embedded in the system BIOS. Unlike discrete
>>> GPUs, the SMU firmware is not loaded by the driver. You could try
>>> updating your system BIOS to the latest version available from your main
>>> board vendor and see if that makes a difference. It may include a newer
>>> version of the SMU firmware, potentially with a fix.
>>>
>>> If that doesn't help, we'd have to reproduce the problem in house to see
>>> what's happening, which may require the same main board and BIOS version
>>> you're using. We can ask our SMU firmware team if they've ever
>>> encountered your type of problem. But I don't want to give you too much
>>> hope. It's a tricky problem involving HW, firmware and multiple driver
>>> components in a fairly unusual configuration.
>>>
>>> Regards,
>>>    Felix
>>>
>>> On 2019-03-13 7:28 a.m., Lauri Ehrenpreis wrote:
>>> > What I observe is that moving the mouse made the memory speed go up
>>> > and also it made mclk=1200Mhz in rocm-smi output.
>>> > However if I force mclk to 1200Mhz myself then memory speed is still
>>> > slow.
>>> >
>>> > So rocm-smi output when memory speed went fast due to mouse movement:
>>> > rocm-smi
>>> > ========================        ROCm System Management Interface
>>> > ========================
>>> > ================================================================================================
>>> > GPU   Temp   AvgPwr   SCLK    MCLK    PCLK      Fan     Perf
>>> > PwrCap   SCLK OD   MCLK OD GPU%
>>> > GPU[0] : WARNING: Empty SysFS value: pclk
>>> > GPU[0] : WARNING: Unable to read
>>> > /sys/class/drm/card0/device/gpu_busy_percent
>>> > 0     44.0c  N/A      400Mhz  1200Mhz N/A       0%      manual  N/A
>>> >   0%        0%  N/A
>>> > ================================================================================================
>>> > ========================               End of ROCm SMI Log
>>> >   ========================
>>> >
>>> > And rocm-smi output when I forced memclk=1200MHz myself:
>>> > rocm-smi --setmclk 2
>>> > rocm-smi
>>> > ========================        ROCm System Management Interface
>>> > ========================
>>> > ================================================================================================
>>> > GPU   Temp   AvgPwr   SCLK    MCLK    PCLK      Fan     Perf
>>> > PwrCap   SCLK OD   MCLK OD GPU%
>>> > GPU[0] : WARNING: Empty SysFS value: pclk
>>> > GPU[0] : WARNING: Unable to read
>>> > /sys/class/drm/card0/device/gpu_busy_percent
>>> > 0     39.0c  N/A      400Mhz  1200Mhz N/A       0%      manual  N/A
>>> >   0%        0%  N/A
>>> > ================================================================================================
>>> > ========================               End of ROCm SMI Log
>>> >   ========================
>>> >
>>> > So only difference is that temperature shows 44c when memory speed was
>>> > fast and 39c when it was slow. But mclk was 1200MHz and sclk was
>>> > 400MHz in both cases.
>>> > Can it be that rocm-smi just has a bug in reporting and mclk was not
>>> > actually 1200MHz when I forced it with rocm-smi --setmclk 2 ?
>>> > That would explain the different behaviour..
>>> >
>>> > If so then is there a programmatic way how to really guarantee the
>>> > high speed mclk? Basically I want do something similar in my program
>>> > what happens if I move
>>> > the mouse in desktop env and this way guarantee the normal memory
>>> > speed each time the program starts.
>>> >
>>> > --
>>> > Lauri
>>> >
>>> >
>>> > On Tue, Mar 12, 2019 at 11:36 PM Deucher, Alexander
>>> > <Alexander.Deucher@xxxxxxx <mailto:Alexander.Deucher@xxxxxxx>> wrote:
>>> >
>>> >     Forcing the sclk and mclk high may impact the CPU frequency since
>>> >     they share TDP.
>>> >
>>> >     Alex
>>> >     ------------------------------------------------------------------------
>>> >     *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx
>>> >     <mailto:amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx>> on behalf of Lauri
>>> >     Ehrenpreis <laurioma@xxxxxxxxx <mailto:laurioma@xxxxxxxxx>>
>>> >     *Sent:* Tuesday, March 12, 2019 5:31 PM
>>> >     *To:* Kuehling, Felix
>>> >     *Cc:* Tom St Denis; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> >     <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
>>> >     *Subject:* Re: Slow memory access when using OpenCL without X11
>>> >     However it's not only related to mclk and sclk. I tried this:
>>> >     rocm-smi  --setsclk 2
>>> >     rocm-smi  --setmclk 3
>>> >     rocm-smi
>>> >     ========================        ROCm System Management Interface
>>> >     ========================
>>> >     ================================================================================================
>>> >     GPU   Temp   AvgPwr   SCLK    MCLK    PCLK          Fan     Perf
>>> >       PwrCap   SCLK OD  MCLK OD  GPU%
>>> >     GPU[0] : WARNING: Empty SysFS value: pclk
>>> >     GPU[0] : WARNING: Unable to read
>>> >     /sys/class/drm/card0/device/gpu_busy_percent
>>> >     0     34.0c  N/A      1240Mhz 1333Mhz N/A           0%
>>> >     manual  N/A      0% 0%       N/A
>>> >     ================================================================================================
>>> >     ========================               End of ROCm SMI Log
>>> >     ========================
>>> >
>>> >     ./cl_slow_test 1
>>> >     got 1 platforms 1 devices
>>> >     speed 3919.777100 avg 3919.777100 mbytes/s
>>> >     speed 3809.373291 avg 3864.575195 mbytes/s
>>> >     speed 585.796814 avg 2771.649170 mbytes/s
>>> >     speed 188.721848 avg 2125.917236 mbytes/s
>>> >     speed 188.916367 avg 1738.517090 mbytes/s
>>> >
>>> >     So despite forcing max sclk and mclk the memory speed is still slow..
>>> >
>>> >     --
>>> >     Lauri
>>> >
>>> >
>>> >     On Tue, Mar 12, 2019 at 11:21 PM Lauri Ehrenpreis
>>> >     <laurioma@xxxxxxxxx <mailto:laurioma@xxxxxxxxx>> wrote:
>>> >
>>> >         IN the case when memory is slow, the rocm-smi outputs this:
>>> >         ========================        ROCm System Management
>>> >         Interface ========================
>>> >         ================================================================================================
>>> >         GPU   Temp   AvgPwr   SCLK    MCLK    PCLK          Fan
>>> >          Perf    PwrCap   SCLK OD  MCLK OD  GPU%
>>> >         GPU[0] : WARNING: Empty SysFS value: pclk
>>> >         GPU[0] : WARNING: Unable to read
>>> >         /sys/class/drm/card0/device/gpu_busy_percent
>>> >         0     30.0c  N/A      400Mhz  933Mhz  N/A           0%
>>> >         auto    N/A      0% 0%       N/A
>>> >         ================================================================================================
>>> >         ========================               End of ROCm SMI Log
>>> >         ========================
>>> >
>>> >         normal memory speed case gives following:
>>> >         ========================        ROCm System Management
>>> >         Interface ========================
>>> >         ================================================================================================
>>> >         GPU   Temp   AvgPwr   SCLK    MCLK    PCLK          Fan
>>> >          Perf    PwrCap   SCLK OD  MCLK OD  GPU%
>>> >         GPU[0] : WARNING: Empty SysFS value: pclk
>>> >         GPU[0] : WARNING: Unable to read
>>> >         /sys/class/drm/card0/device/gpu_busy_percent
>>> >         0     35.0c  N/A      400Mhz  1200Mhz N/A           0%
>>> >         auto    N/A      0% 0%       N/A
>>> >         ================================================================================================
>>> >         ========================               End of ROCm SMI Log
>>> >         ========================
>>> >
>>> >         So there is a difference in MCLK - can this cause such a huge
>>> >         slowdown?
>>> >
>>> >         --
>>> >         Lauri
>>> >
>>> >         On Tue, Mar 12, 2019 at 6:39 PM Kuehling, Felix
>>> >         <Felix.Kuehling@xxxxxxx <mailto:Felix.Kuehling@xxxxxxx>> wrote:
>>> >
>>> >             [adding the list back]
>>> >
>>> >             I'd suspect a problem related to memory clock. This is an
>>> >             APU where
>>> >             system memory is shared with the CPU, so if the SMU
>>> >             changes memory
>>> >             clocks that would affect CPU memory access performance. If
>>> >             the problem
>>> >             only occurs when OpenCL is running, then the compute power
>>> >             profile could
>>> >             have an effect here.
>>> >
>>> >             Laurie, can you monitor the clocks during your tests using
>>> >             rocm-smi?
>>> >
>>> >             Regards,
>>> >                Felix
>>> >
>>> >             On 2019-03-11 1:15 p.m., Tom St Denis wrote:
>>> >             > Hi Lauri,
>>> >             >
>>> >             > I don't have ROCm installed locally (not on that team at
>>> >             AMD) but I
>>> >             > can rope in some of the KFD folk and see what they say :-).
>>> >             >
>>> >             > (in the mean time I should look into installing the ROCm
>>> >             stack on my
>>> >             > Ubuntu disk for experimentation...).
>>> >             >
>>> >             > Only other thing that comes to mind is some sort of
>>> >             stutter due to
>>> >             > power/clock gating (or gfx off/etc).  But that typically
>>> >             affects the
>>> >             > display/gpu side not the CPU side.
>>> >             >
>>> >             > Felix:  Any known issues with Raven and ROCm interacting
>>> >             over memory
>>> >             > bus performance?
>>> >             >
>>> >             > Tom
>>> >             >
>>> >             > On Mon, Mar 11, 2019 at 12:56 PM Lauri Ehrenpreis
>>> >             <laurioma@xxxxxxxxx <mailto:laurioma@xxxxxxxxx>
>>> >             > <mailto:laurioma@xxxxxxxxx <mailto:laurioma@xxxxxxxxx>>>
>>> >             wrote:
>>> >             >
>>> >             >     Hi!
>>> >             >
>>> >             >     The 100x memory slowdown is hard to belive indeed. I
>>> >             attached the
>>> >             >     test program with my first e-mail which depends only on
>>> >             >     rocm-opencl-dev package. Would you mind compiling it
>>> >             and checking
>>> >             >     if it slows down memory for you as well?
>>> >             >
>>> >             >     steps:
>>> >             >     1) g++ cl_slow_test.cpp -o cl_slow_test -I
>>> >             >     /opt/rocm/opencl/include/ -L
>>> >             /opt/rocm/opencl/lib/x86_64/  -lOpenCL
>>> >             >     2) logout from desktop env and disconnect
>>> >             hdmi/diplayport etc
>>> >             >     3) log in over ssh
>>> >             >     4) run the program ./cl_slow_test 1
>>> >             >
>>> >             >     For me it reproduced even without step 2 as well but
>>> >             less
>>> >             >     reliably. moving mouse for example could make the
>>> >             memory speed
>>> >             >     fast again.
>>> >             >
>>> >             >     --
>>> >             >     Lauri
>>> >             >
>>> >             >
>>> >             >
>>> >             >     On Mon, Mar 11, 2019 at 6:33 PM Tom St Denis
>>> >             <tstdenis82@xxxxxxxxx <mailto:tstdenis82@xxxxxxxxx>
>>> >             >     <mailto:tstdenis82@xxxxxxxxx
>>> >             <mailto:tstdenis82@xxxxxxxxx>>> wrote:
>>> >             >
>>> >             >         Hi Lauri,
>>> >             >
>>> >             >         There's really no connection between the two
>>> >             other than they
>>> >             >         run in the same package.  I too run a 2400G (as my
>>> >             >         workstation) and I got the same ~6.6GB/sec
>>> >             transfer rate but
>>> >             >         without a CL app running ...  The only logical
>>> >             reason is your
>>> >             >         CL app is bottlenecking the APUs memory bus but
>>> >             you claim
>>> >             >         "simply opening a context is enough" so
>>> >             something else is
>>> >             >         going on.
>>> >             >
>>> >             >         Your last reply though says "with it running in the
>>> >             >         background" so it's entirely possible the CPU
>>> >             isn't busy but
>>> >             >         the package memory controller (shared between
>>> >             both the CPU and
>>> >             >         GPU) is busy.  For instance running xonotic in a
>>> >             1080p window
>>> >             >         on my 4K display reduced the memory test to
>>> >             5.8GB/sec and
>>> >             >         that's hardly a heavy memory bound GPU app.
>>> >             >
>>> >             >         The only other possible connection is the GPU is
>>> >             generating so
>>> >             >         much heat that it's throttling the package which
>>> >             is also
>>> >             >         unlikely if you have a proper HSF attached (I
>>> >             use the ones
>>> >             >         that came in the retail boxes).
>>> >             >
>>> >             >         Cheers,
>>> >             >         Tom
>>> >             >
>>> >
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux