Change queue/pipe split between amdkfd and amdgpu

funfunctor@xxxxxxxxxxxxxxxx (Edward O'Callaghan) · Thu, 16 Feb 2017 18:14:24 +1100

On 02/16/2017 03:00 PM, Bridgman, John wrote:
> Any objections to authorizing Oded to post the kfdtest binary he is using to some public place (if not there already) so others (like Andres) can test changes which touch on amdkfd ? 
> 
> We should check it for embarrassing symbols but otherwise it should be OK.

someone was up late for a dead line? lol

> 
> That said, since we are getting perilously close to actually sending dGPU support changes upstream we will need (IMO) to maintain a sanitized source repo for kfdtest as well... sharing the binary just gets us started.
> 

Hi John,

Yes, this is the sort of thing I've been referring to for some time now.
We definitely need some kind of centralized mechanism to test/validate
kfd stuff so if you can get this out that would be great! A binary would
be a start, I am sure we can made do and its certainly better than
nothing, however source much like what happened with UMR would be of
course ideal.

I suggest to you that it would perhaps be good if we could arrange some
kind of IRC meeting regarding kfd? Since it seems there is a bit of
fragmented effort here. I have my own ioctl()'s locally for pinning for
my own project which I am not sure are suitable to just upstream as AMD
has its own take so what should we do? I heard so much about dGPU
support for a couple of years now but only seen bits thrown over the
wall. Can we begin a more serious incremental approach happening ASAP?
I created #amdkfd on freenode some time ago which a couple of interested
academics and users hang.

Kind Regards,
Edward.

> Thanks,
> John
> 
>> -----Original Message-----
>> From: Oded Gabbay [mailto:oded.gabbay at gmail.com]
>> Sent: Friday, February 10, 2017 12:57 PM
>> To: Andres Rodriguez
>> Cc: Kuehling, Felix; Bridgman, John; amd-gfx at lists.freedesktop.org;
>> Deucher, Alexander; Jay Cornwall
>> Subject: Re: Change queue/pipe split between amdkfd and amdgpu
>>
>> I don't have a repo, nor do I have the source code.
>> It is a tool that we developed inside AMD (when I was working there), and
>> after I left AMD I got permission to use the binary for regressions testing.
>>
>> Oded
>>
>> On Fri, Feb 10, 2017 at 6:33 PM, Andres Rodriguez <andresx7 at gmail.com>
>> wrote:
>>> Hey Oded,
>>>
>>> Where can I find a repo with kfdtest?
>>>
>>> I tried looking here bit couldn't find it:
>>>
>>> https://cgit.freedesktop.org/~gabbayo/
>>>
>>> -Andres
>>>
>>>
>>>
>>> On 2017-02-10 05:35 AM, Oded Gabbay wrote:
>>>>
>>>> So the warning in dmesg is gone of course, but the test (that I
>>>> mentioned in previous email) still fails, and this time it caused the
>>>> kernel to crash. In addition, now other tests fail as well, e.g.
>>>> KFDEventTest.SignalEvent
>>>>
>>>> I honestly suggest to take some time to debug this patch-set on an
>>>> actual Kaveri machine and then re-send the patches.
>>>>
>>>> Thanks,
>>>> Oded
>>>>
>>>> log of crash from KFDQMTest.CreateMultipleCpQueues:
>>>>
>>>> [  160.900137] kfd: qcm fence wait loop timeout expired [
>>>> 160.900143] kfd: the cp might be in an unrecoverable state due to an
>>>> unsuccessful queues preemption [  160.916765] show_signal_msg: 36
>>>> callbacks suppressed [  160.916771] kfdtest[2498]: segfault at
>>>> 100007f8a ip 00007f8ae932ee5d sp 00007ffc52219cd0 error 4 in
>>>> libhsakmt-1.so.0.0.1[7f8ae932b000+8000]
>>>> [  163.152229] kfd: qcm fence wait loop timeout expired [
>>>> 163.152250] BUG: unable to handle kernel NULL pointer dereference at
>>>> 000000000000005a [  163.152299] IP:
>>>> kfd_get_process_device_data+0x6/0x30 [amdkfd] [  163.152323] PGD
>>>> 2333aa067 [  163.152323] PUD 230f64067 [  163.152335] PMD 0
>>>>
>>>> [  163.152364] Oops: 0000 [#1] SMP
>>>> [  163.152379] Modules linked in: joydev edac_mce_amd edac_core
>>>> input_leds kvm_amd snd_hda_codec_realtek kvm irqbypass
>>>> snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel
>> snd_hda_codec
>>>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core
>>>> snd_hwdep pcbc snd_pcm aesni_intel snd_seq_midi snd_seq_midi_event
>>>> snd_rawmidi snd_seq aes_x86_64 crypto_simd snd_seq_device
>> glue_helper
>>>> cryptd snd_timer snd fam15h_power k10temp soundcore i2c_piix4 shpchp
>>>> tpm_infineon mac_hid parport_pc ppdev nfsd auth_rpcgss nfs_acl lockd
>>>> lp grace sunrpc parport autofs4 hid_logitech_hidpp hid_logitech_dj
>>>> hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2 radeon
>>>> i2c_algo_bit ttm drm_kms_helper syscopyarea ahci sysfillrect
>>>> sysimgblt libahci fb_sys_fops drm r8169 mii fjes video [  163.152668]
>>>> CPU: 3 PID: 2498 Comm: kfdtest Not tainted 4.10.0-rc5+ #3 [
>>>> 163.152695] Hardware name: Gigabyte Technology Co., Ltd. To be filled
>>>> by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014 [  163.152735] task:
>>>> ffff995e73d16580 task.stack: ffffb41144458000 [  163.152764] RIP:
>>>> 0010:kfd_get_process_device_data+0x6/0x30 [amdkfd] [  163.152790]
>>>> RSP: 0018:ffffb4114445bab0 EFLAGS: 00010246 [  163.152812] RAX:
>>>> ffffffffffffffea RBX: ffff995e75909c00 RCX:
>>>> 0000000000000000
>>>> [  163.152841] RDX: 0000000000000000 RSI: ffffffffffffffea RDI:
>>>> ffff995e75909600
>>>> [  163.152869] RBP: ffffb4114445bae0 R08: 00000000000252a5 R09:
>>>> 0000000000000414
>>>> [  163.152898] R10: 0000000000000000 R11: ffffffffb412d38d R12:
>>>> 00000000ffffffc2
>>>> [  163.152926] R13: 0000000000000000 R14: ffff995e75909ca8 R15:
>>>> ffff995e75909c00
>>>> [  163.152956] FS:  00007f8ae975e740(0000) GS:ffff995e7ed80000(0000)
>>>> knlGS:0000000000000000
>>>> [  163.152988] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [
>>>> 163.153012] CR2: 000000000000005a CR3: 00000002216ab000 CR4:
>>>> 00000000000406e0
>>>> [  163.153041] Call Trace:
>>>> [  163.153059]  ? destroy_queues_cpsch+0x166/0x190 [amdkfd] [
>>>> 163.153086]  execute_queues_cpsch+0x2e/0xc0 [amdkfd] [  163.153113]
>>>> destroy_queue_cpsch+0xbd/0x140 [amdkfd] [  163.153139]
>>>> pqm_destroy_queue+0x111/0x1d0 [amdkfd] [  163.153164]
>>>> pqm_uninit+0x3f/0xb0 [amdkfd] [  163.153186]
>>>> kfd_unbind_process_from_device+0x51/0xd0 [amdkfd] [  163.153214]
>>>> iommu_pasid_shutdown_callback+0x20/0x30 [amdkfd] [  163.153239]
>>>> mn_release+0x37/0x70 [amd_iommu_v2] [  163.153261]
>>>> __mmu_notifier_release+0x44/0xc0 [  163.153281]
>>>> exit_mmap+0x15a/0x170 [  163.153297]  ? __wake_up+0x44/0x50 [
>>>> 163.153314]  ? exit_robust_list+0x5c/0x110 [  163.153333]
>>>> mmput+0x57/0x140 [  163.153347]  do_exit+0x26b/0xb30 [  163.153362]
>>>> do_group_exit+0x43/0xb0 [  163.153379]  get_signal+0x293/0x620 [
>>>> 163.153396]  do_signal+0x37/0x760 [  163.153411]  ?
>>>> print_vma_addr+0x82/0x100 [  163.153429]  ? vprintk_default+0x29/0x50
>>>> [  163.153447]  ? bad_area+0x46/0x50 [  163.153463]  ?
>>>> __do_page_fault+0x3c7/0x4e0 [  163.153481]
>>>> exit_to_usermode_loop+0x76/0xb0 [  163.153500]
>>>> prepare_exit_to_usermode+0x2f/0x40
>>>> [  163.153521]  retint_user+0x8/0x10
>>>> [  163.153536] RIP: 0033:0x7f8ae932ee5d [  163.153551] RSP:
>>>> 002b:00007ffc52219cd0 EFLAGS: 00010202 [  163.153573] RAX:
>>>> 0000000000000003 RBX: 0000000100007f8a RCX:
>>>> 00007ffc52219d00
>>>> [  163.153602] RDX: 00007f8ae9534220 RSI: 00007f8ae8b5eb28 RDI:
>>>> 0000000100007f8a
>>>> [  163.153630] RBP: 00007ffc52219d20 R08: 0000000001cc1890 R09:
>>>> 0000000000000000
>>>> [  163.153659] R10: 0000000000000027 R11: 00007f8ae932ee10 R12:
>>>> 0000000001cc52a0
>>>> [  163.153687] R13: 00007ffc5221a200 R14: 0000000000000021 R15:
>>>> 0000000000000000
>>>> [  163.153716] Code: e0 04 00 00 48 3b 91 f0 03 00 00 74 01 c3 55 48
>>>> 89 e5 e8 2e f9 ff ff 5d c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
>>>> 44 00 00 55 <48> 8b 46 70 48 83 c6 70 48 89 e5 48 39 f0 74 16 48 3b
>>>> 78
>>>> 10 75
>>>> [  163.153818] RIP: kfd_get_process_device_data+0x6/0x30 [amdkfd] RSP:
>>>> ffffb4114445bab0
>>>> [  163.153848] CR2: 000000000000005a
>>>> [  163.160389] ---[ end trace f6a8177c7119c1f5 ]--- [  163.160390]
>>>> Fixing recursive fault but reboot is needed!
>>>>
>>>> On Thu, Feb 9, 2017 at 10:38 PM, Andres Rodriguez
>>>> <andresx7 at gmail.com>
>>>> wrote:
>>>>>
>>>>> Hey Oded,
>>>>>
>>>>> Sorry to be a nuisance, but if you have everything still setup could
>>>>> you give this fix a quick go?
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> index 5321d18..9f70ee0 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> @@ -667,7 +667,7 @@ static int set_sched_resources(struct
>>>>> device_queue_manager *dqm)
>>>>>                  /* This situation may be hit in the future if a new HW
>>>>>                   * generation exposes more than 64 queues. If so, the
>>>>>                   * definition of res.queue_mask needs updating */
>>>>> -               if (WARN_ON(i > sizeof(res.queue_mask))) {
>>>>> +               if (WARN_ON(i > (sizeof(res.queue_mask)*8))) {
>>>>>                          pr_err("Invalid queue enabled by amdgpu:
>>>>> %d\n", i);
>>>>>                          break;
>>>>>                  }
>>>>>
>>>>> John/Felix,
>>>>>
>>>>> Any chance I could borrow a carrizo/kaveri for a few days? Or maybe
>>>>> you could help me run some final tests on this patch series?
>>>>>
>>>>> - Andres
>>>>>
>>>>>
>>>>>
>>>>> On 2017-02-09 03:11 PM, Oded Gabbay wrote:
>>>>>>
>>>>>>    Andres,
>>>>>>
>>>>>> I tried your patches on Kaveri with airlied's drm-next branch.
>>>>>> I used radeon+amdkfd
>>>>>>
>>>>>> The following test failed: KFDQMTest.CreateMultipleCpQueues
>>>>>> However, I can't debug it because I don't have the sources of kfdtest.
>>>>>>
>>>>>> In dmesg, I saw the following warning during boot:
>>>>>> WARNING: CPU: 0 PID: 150 at
>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:670
>>>>>> start_cpsch+0xc5/0x220 [amdkfd]
>>>>>> [    4.393796] Modules linked in: hid_logitech_hidpp hid_logitech_dj
>>>>>> hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2
>>>>>> radeon(+) i2c_algo_bit ttm drm_kms_helper syscopyarea ahci
>>>>>> sysfillrect sysimgblt libahci fb_sys_fops drm r8169 mii fjes video
>>>>>> [    4.393811] CPU: 0 PID: 150 Comm: systemd-udevd Not tainted
>>>>>> 4.10.0-rc5+
>>>>>> #1
>>>>>> [    4.393811] Hardware name: Gigabyte Technology Co., Ltd. To be
>>>>>> filled by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014
>>>>>> [    4.393812] Call Trace:
>>>>>> [    4.393818]  dump_stack+0x63/0x90
>>>>>> [    4.393822]  __warn+0xcb/0xf0
>>>>>> [    4.393823]  warn_slowpath_null+0x1d/0x20
>>>>>> [    4.393830]  start_cpsch+0xc5/0x220 [amdkfd]
>>>>>> [    4.393836]  ? initialize_cpsch+0xa0/0xb0 [amdkfd]
>>>>>> [    4.393841]  kgd2kfd_device_init+0x375/0x490 [amdkfd]
>>>>>> [    4.393883]  radeon_kfd_device_init+0xaf/0xd0 [radeon]
>>>>>> [    4.393911]  radeon_driver_load_kms+0x11e/0x1f0 [radeon]
>>>>>> [    4.393933]  drm_dev_register+0x14a/0x200 [drm]
>>>>>> [    4.393946]  drm_get_pci_dev+0x9d/0x160 [drm]
>>>>>> [    4.393974]  radeon_pci_probe+0xb8/0xe0 [radeon]
>>>>>> [    4.393976]  local_pci_probe+0x45/0xa0
>>>>>> [    4.393978]  pci_device_probe+0x103/0x150
>>>>>> [    4.393981]  driver_probe_device+0x2bf/0x460
>>>>>> [    4.393982]  __driver_attach+0xdf/0xf0
>>>>>> [    4.393984]  ? driver_probe_device+0x460/0x460
>>>>>> [    4.393985]  bus_for_each_dev+0x6c/0xc0
>>>>>> [    4.393987]  driver_attach+0x1e/0x20
>>>>>> [    4.393988]  bus_add_driver+0x1fd/0x270
>>>>>> [    4.393989]  ? 0xffffffffc05c8000
>>>>>> [    4.393991]  driver_register+0x60/0xe0
>>>>>> [    4.393992]  ? 0xffffffffc05c8000
>>>>>> [    4.393993]  __pci_register_driver+0x4c/0x50
>>>>>> [    4.394007]  drm_pci_init+0xeb/0x100 [drm]
>>>>>> [    4.394008]  ? 0xffffffffc05c8000
>>>>>> [    4.394031]  radeon_init+0x98/0xb6 [radeon]
>>>>>> [    4.394034]  do_one_initcall+0x53/0x1a0
>>>>>> [    4.394037]  ? __vunmap+0x81/0xd0
>>>>>> [    4.394039]  ? kmem_cache_alloc_trace+0x152/0x1c0
>>>>>> [    4.394041]  ? vfree+0x2e/0x70
>>>>>> [    4.394044]  do_init_module+0x5f/0x1ff
>>>>>> [    4.394046]  load_module+0x24cc/0x29f0
>>>>>> [    4.394047]  ? __symbol_put+0x60/0x60
>>>>>> [    4.394050]  ? security_kernel_post_read_file+0x6b/0x80
>>>>>> [    4.394052]  SYSC_finit_module+0xdf/0x110
>>>>>> [    4.394054]  SyS_finit_module+0xe/0x10
>>>>>> [    4.394056]  entry_SYSCALL_64_fastpath+0x1e/0xad
>>>>>> [    4.394058] RIP: 0033:0x7f9cda77c8e9
>>>>>> [    4.394059] RSP: 002b:00007ffe195d3378 EFLAGS: 00000246 ORIG_RAX:
>>>>>> 0000000000000139
>>>>>> [    4.394060] RAX: ffffffffffffffda RBX: 00007f9cdb8dda7e RCX:
>>>>>> 00007f9cda77c8e9
>>>>>> [    4.394061] RDX: 0000000000000000 RSI: 00007f9cdac7ce2a RDI:
>>>>>> 0000000000000013
>>>>>> [    4.394062] RBP: 00007ffe195d2450 R08: 0000000000000000 R09:
>>>>>> 0000000000000000
>>>>>> [    4.394063] R10: 0000000000000013 R11: 0000000000000246 R12:
>>>>>> 00007ffe195d245a
>>>>>> [    4.394063] R13: 00007ffe195d1378 R14: 0000563f70cc93b0 R15:
>>>>>> 0000563f70cba4d0
>>>>>> [    4.394091] ---[ end trace 9c5af17304d998bb ]---
>>>>>> [    4.394092] Invalid queue enabled by amdgpu: 9
>>>>>>
>>>>>> I suggest you get a Kaveri/Carrizo machine to debug these issues.
>>>>>>
>>>>>> Until that, I don't think we should merge this patch-set.
>>>>>>
>>>>>> Oded
>>>>>>
>>>>>> On Wed, Feb 8, 2017 at 9:47 PM, Andres Rodriguez
>>>>>> <andresx7 at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Thank you Oded.
>>>>>>>
>>>>>>> - Andres
>>>>>>>
>>>>>>>
>>>>>>> On 2017-02-08 02:32 PM, Oded Gabbay wrote:
>>>>>>>>
>>>>>>>> On Wed, Feb 8, 2017 at 6:23 PM, Andres Rodriguez
>>>>>>>> <andresx7 at gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hey Felix,
>>>>>>>>>
>>>>>>>>> Thanks for the pointer to the ROCm mqd commit. I like that the
>>>>>>>>> workarounds are easy to spot. I'll add that to a new patch
>>>>>>>>> series I'm working on for some bug-fixes for perf being lower on
>>>>>>>>> pipes other than pipe 0.
>>>>>>>>>
>>>>>>>>> I haven't tested this yet on kaveri/carrizo. I'm hoping someone
>>>>>>>>> with the HW will be able to give it a go. I put in a few small
>>>>>>>>> hacks to get KFD to boot but do nothing on polaris10.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Andres
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2017-02-06 03:20 PM, Felix Kuehling wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Andres,
>>>>>>>>>>
>>>>>>>>>> Thank you for tackling this task. It's more involved than I
>>>>>>>>>> expected, mostly because I didn't have much awareness of the
>>>>>>>>>> MQD management in amdgpu.
>>>>>>>>>>
>>>>>>>>>> I made one comment in a separate message about the unified MQD
>>>>>>>>>> commit function, if you want to bring that more in line with
>>>>>>>>>> our latest ROCm release on github.
>>>>>>>>>>
>>>>>>>>>> Also, were you able to test the upstream KFD with your changes
>>>>>>>>>> on a Kaveri or Carrizo?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>      Felix
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 17-02-03 11:51 PM, Andres Rodriguez wrote:
>>>>>>>>>>>
>>>>>>>>>>> The current queue/pipe split policy is for amdgpu to take the
>>>>>>>>>>> first pipe of
>>>>>>>>>>> MEC0 and leave the rest for amdkfd to use. This policy is
>>>>>>>>>>> taken as an assumption in a few areas of the implementation.
>>>>>>>>>>>
>>>>>>>>>>> This patch series aims to allow for flexible/tunable
>>>>>>>>>>> queue/pipe split policies between kgd and kfd. It also updates
>>>>>>>>>>> the queue/pipe split policy to one that allows better compute
>>>>>>>>>>> app concurrency for both drivers.
>>>>>>>>>>>
>>>>>>>>>>> In the process some duplicate code and hardcoded constants
>>>>>>>>>>> were removed.
>>>>>>>>>>>
>>>>>>>>>>> Any suggestions or feedback on improvements welcome.
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>> Hi Andres,
>>>>>>>> I will try to find sometime to test it on my Kaveri machine.
>>>>>>>>
>>>>>>>> Oded
>>>>>>>
>>>>>>>
>>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170216/0a456e60/attachment-0001.sig>