RE: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool

"Liu, Monk" <Monk.Liu@xxxxxxx> · Thu, 14 Nov 2024 08:59:46 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

For WPTR polling mem we didn't find it is stalled data, but for ring buffer we are pretty sure,

We use "0" or "ffffffff" to memset() HIQ before we fulfill the QUERY_STATUS pm4, and we can see when issue hit CP do get the either 0 or "ffffffff" .

Monk Liu | Cloud GPU & Virtualization | AMD

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Thursday, November 14, 2024 3:13 PM
To: Liu, Monk <Monk.Liu@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Alex Deucher <alexdeucher@xxxxxxxxx>; Zhao, Victor <Victor.Zhao@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Philip <Philip.Yang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool

On 11/14/2024 12:29 PM, Liu, Monk wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>>> QEMU and their virtual drivers uses a virtual PCI MMIO BAR to map buffers from the host into the guest. Those buffers can be both point to the FB BAR as well as system memory.
>
> Well, you are talking about virt-io GL use case? Or mDev solution ? because those conventional device like QXL, NIC, soundcard are purely software emulated and no way to point to the FB BAR ...
>
> I can understand that Host make the final decision on the memory if it is FB can be leveraged by guest as some system memory, but brutely treat all system memory as cache no matter what guest do is ugly ...
>
> Back to our topic, I propagated you and Lijo's statement to customer and they didn't buy it, rational as:
>
> 1) they don't think let host honor the real mapping is correct as they don't have anything like virt-io GL solution in use...
> 2) they cannot suffer risk to revert that KVM patch on their server.
>
> So the thing is they will live with a hotfix patch for now, but they are also asking:
>
> Why cannot AMD make all ring buffer (managed by KMD/KFD) to the same type which is cache, as they know that GFX ring from amdgpu is cache type.
>
> Besides, both customer and CVS team will also keep investigate why
> after the map is honored by guest there is still coherence issue (type
> is USWC, mfence is inserted between doorbell and wptr_polling updates)
>

If this is truly mfence-not-working-as-expected, wondering how any user app which relies on a similar mechanism can reliably work on their systems.

Had already mentioned this before -

Even by the assumption that mfence fixed ordering but data stayed in WC buffer, there is one other hole in this theory. With DOORBELL_MODE=1, any write pointer update is fetched from memory. That write pointer is also maintained in WC type memory. If CP is correctly able to see WPTR update, that means that part of write got updated in system memory.

Anyway, please keep us posted.

Thanks,
Lijo

> Anyway, Lijo, Christian, thanks for your explain and insight !
>
> Monk Liu | Cloud GPU & Virtualization | AMD
>
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> Sent: Tuesday, November 12, 2024 5:41 PM
> To: Liu, Monk <Monk.Liu@xxxxxxx>; Christian König
> <ckoenig.leichtzumerken@xxxxxxxxx>; Lazar, Lijo <Lijo.Lazar@xxxxxxx>;
> Alex Deucher <alexdeucher@xxxxxxxxx>; Zhao, Victor
> <Victor.Zhao@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Philip <Philip.Yang@xxxxxxx>;
> Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and
> wb pool
>
> Hi Monk,
>
>> The memory we are talking about is pure guest system memory not FB BAR right?
>
> No, that works a bit differently.
>
> QEMU and their virtual drivers uses a virtual PCI MMIO BAR to map buffers from the host into the guest. Those buffers can be both point to the FB BAR as well as system memory.
>
> Only the host knows if the buffer is backed by MMIO (FB) or system memory, so only the host can setup the tables to have the correct caching and WC attributes.
>
> The KVM patch completely breaks that because they suddenly say that not the host setups the caching and WC attributes but the guest.
>
> Because of that this KVM patch not only breaks amdgpu, but also other drivers which do the same thing. The general approach of the patch seems to be broken.
>
>>           if (kq->dev->kfd->device_info.doorbell_size == 8) {
>>                   *kq->wptr64_kernel = kq->pending_wptr64;
>> +               mb();
>>                   write_kernel_doorbell64(kq->queue->properties.doorbell_ptr,
>>                                           kq->pending_wptr64);
>>           } else {
>>                   *kq->wptr_kernel = kq->pending_wptr;
>> +               mb();
>>                   write_kernel_doorbell(kq->queue->properties.doorbell_ptr,
>>                                           kq->pending_wptr);
>
> We should probably move that into
> write_kernel_doorbell()/write_kernel_doorbell64() but in general looks correct to me as well.
>
> Regards,
> Christian.
>
> Am 12.11.24 um 04:50 schrieb Liu, Monk:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> Hi Lijo
>>
>> Thanks for your quote to the SPEC, yes I read that as well, supposedly WC storage buffer shall be flushed to memory with mfence, but it just doesn't work after applied that KVM patch which honor the WC attributes from guest.
>>
>> This means the WC storage buffer from a KVM guest couldn’t be flushed
>> to memory even with mb inserted... I don't know why. (is there any
>> inconsistencies between nested page tables and host page tables?)
>>
>> Hi Christian
>>
>>>> The guest sees driver exposed memory as "emulated" MMIO BAR of a PCIe device, that's just how QEMU is designed. Because of this the guest maps the memory USWC or uncached. But in reality the memory can as well be GTT memory which is cached.
>> The memory we are talking about is pure guest system memory not FB BAR right? Therefore, I don't see why in reality it has to be cached, you see that only guest vCPU is accessing this buffer.
>>
>>
>> BTW: we should aways insert that MB() in our KFD part:
>>
>>>>           if (kq->dev->kfd->device_info.doorbell_size == 8) {
>>>>                   *kq->wptr64_kernel = kq->pending_wptr64;
>>>> +               mb();
>>>>                   write_kernel_doorbell64(kq->queue->properties.doorbell_ptr,
>>>>                                           kq->pending_wptr64);
>>>>           } else {
>>>>                   *kq->wptr_kernel = kq->pending_wptr;
>>>> +               mb();
>>>>                   write_kernel_doorbell(kq->queue->properties.doorbell_ptr,
>>>>                                           kq->pending_wptr);
>>>>           }
>> Thanks
>>
>> Monk Liu | Cloud GPU & Virtualization | AMD
>>
>>
>>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>> Sent: Monday, November 11, 2024 8:40 PM
>> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>;
>> Koenig, Christian <Christian.Koenig@xxxxxxx>; Alex Deucher
>> <alexdeucher@xxxxxxxxx>; Zhao, Victor <Victor.Zhao@xxxxxxx>
>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Philip
>> <Philip.Yang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and
>> wb pool
>>
>> Hi guys,
>>
>> trying to merge multiple responses back into one mail thread.
>>
>> Lijo, you're completely right, but there is one really strong argument you are missing:
>>
>> The patch here doesn't change the write combine behavior, it changes the caching behavior!
>>
>> So when the patch works and fixes the issue, then the issue is 100%
>> certain not a write combine issue but rather an incorrectly applied
>> caching :)
>>
>>>   From what I heard there was a KVM patch to correct the mapping behavior -- previously the mapping is forced to cache mode even guest KMD maps the buffer with "WC".
>>> But after the patch the buffer will be really "WC" if guest maps it with WC"... and this reveals the bug and hit our issue.
>> Yeah, exactly that's the problem. The guest doesn't know if WB, USWC or uncached should be used!
>>
>> The guest sees driver exposed memory as "emulated" MMIO BAR of a PCIe device, that's just how QEMU is designed. Because of this the guest maps the memory USWC or uncached. But in reality the memory can as well be GTT memory which is cached.
>>
>> So forcing the cache mode even if the guest KMD maps the buffer with "WC" is intentional behavior, that's the correct approach.
>>
>> The upstream people realized that as well. That's one major reason why the patch was reverted.
>>
>>> Can you explain why USWC for ring buffer is safe, why it will not hit coherence issue, why you don't make all ring buffers with USWC even for amdgpu side if you really believe USWC is innocent ...
>> We already tried this, but there is simply no benefit for it.
>>
>> Regards,
>> Christian.
>>
>>
>> Am 11.11.24 um 06:58 schrieb Lazar, Lijo:
>>> On 11/11/2024 7:00 AM, Liu, Monk wrote:
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>> Hi Lijo
>>>>
>>>> This is the patch we verified before:
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>>> index 4843dcb9a5f7..39553c7648eb 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>>> @@ -308,10 +308,12 @@ int kq_submit_packet(struct kernel_queue *kq)
>>>>
>>>>           if (kq->dev->kfd->device_info.doorbell_size == 8) {
>>>>                   *kq->wptr64_kernel = kq->pending_wptr64;
>>>> +               mb();
>>>>                   write_kernel_doorbell64(kq->queue->properties.doorbell_ptr,
>>>>                                           kq->pending_wptr64);
>>>>           } else {
>>>>                   *kq->wptr_kernel = kq->pending_wptr;
>>>> +               mb();
>>>>                   write_kernel_doorbell(kq->queue->properties.doorbell_ptr,
>>>>                                           kq->pending_wptr);
>>>>           }
>>>>
>>>>
>>>> This mb() doesn't resolve the problem during customer's testing, I also thought of MB() first in beginning like you and Christian ...
>>>> The mb() here shall resolve the re-ordering between WPTR and doorbell kicking so GPU won't fetch stalled data from WPTR polling once it receives notification from doorbell kicking.
>>>> (in SR-IOV we set doorbell mode to force GPU still fetch from WPTR
>>>> polling area, doorbell kicking is just to notify GPU)
>>>>
>>>> And by your theory: mb() shall flush the WC storage buffer to memory, thus, this mb() shall also make sure that the ring buffer is not holding stalled data anymore, right ?
>>> This is not my theory. All the quotes in my earlier mails are from
>>> "Intel® 64 and IA-32 Architectures Software Developer’s Manual".
>>>
>>> Yet another one -
>>>
>>> <snip>
>>>
>>> Writes may be delayed and combined in the write combining buffer (WC
>>> buffer) to reduce memory accesses.
>>>
>>> "If the WC buffer is partially filled, the writes may be delayed
>>> until the next occurrence of a serializing event; such as an SFENCE
>>> or MFENCE instruction, CPUID or other serializing instruction, a
>>> read or write to uncached memory, an interrupt occurrence, or an
>>> execution of a LOCK instruction (including one with an XACQUIRE or
>>> XRELEASE prefix)."
>>>
>>> </snip>
>>>
>>>
>>>> But we still hit hang and get stalled data from dump.
>>>>
>>>> Maybe we need to put mb() in another place ? can you proposal if you insist the cache mapping is not acceptable to you and we can ask customer to verify again.
>>>>
>>> There are a couple of things which are still working strangely, they
>>> will need some explanation as well -
>>>
>>>        Even though write pointer allocations are also to WC region,
>>> and are correctly seen by CP every time. Since it needs to get wptr
>>> updates from memory rather than doorbell value, don't know how your
>>> snooping theory works for this. Also, it is weird that WC writes to
>>> those pass MQD buffer writes even with fence.
>>>
>>>        MQD allocation for user queues are still from USWC as per
>>> your latest patch and they still work correctly.
>>>
>>> For debug -
>>>        Is the WPTR reg value correctly reflecting the memory value?
>>> Have you tried initializing MQD to a known pattern (memset32/64)
>>> like a series of 0xcafedead and verified what is seen at the CP side?
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> Thanks
>>>>
>>>> Monk Liu | Cloud GPU & Virtualization | AMD
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>>> Sent: Friday, November 8, 2024 7:26 PM
>>>> To: Liu, Monk <Monk.Liu@xxxxxxx>; Koenig, Christian
>>>> <Christian.Koenig@xxxxxxx>; Alex Deucher <alexdeucher@xxxxxxxxx>;
>>>> Zhao, Victor <Victor.Zhao@xxxxxxx>
>>>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Philip
>>>> <Philip.Yang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>>>> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ
>>>> and wb pool
>>>>
>>>>
>>>>
>>>> On 11/8/2024 4:29 PM, Liu, Monk wrote:
>>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>>
>>>>> To be clear for the mb() approach: Even if we insert mb() in prior to amdgpu_ring_set_wptr(ring), GPU might still fetch stalled data from PQ due to USWC attributes.
>>>>>
>>>> Inserting an mb() doesn't cause WC buffer flush is a wrong assumption.
>>>>
>>>> All prior loads/stores are supposed to be globally visible. Hence mb() followed by a write pointer update also should guarantee the same (From Arch manual).
>>>>
>>>>           The MFENCE instruction establishes a memory fence for
>>>> both loads and stores. The processor ensures that no load or store
>>>> after MFENCE will become globally visible *until all loads and
>>>> stores before MFENCE are globally visible.*
>>>>
>>>>
>>>> Ignoring the amdgpu driver part of it - if mb() is not working as expected as you claim that means something is wrong with the system.
>>>>
>>>> USWC or WB for ring type may still be a separate discussion.
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> The issue here is not the re-ordering but the stalled PQ.
>>>>>
>>>>> Monk Liu | Cloud GPU & Virtualization | AMD
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Monk
>>>>> Sent: Friday, November 8, 2024 6:22 PM
>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Lazar, Lijo
>>>>> <Lijo.Lazar@xxxxxxx>; Alex Deucher <alexdeucher@xxxxxxxxx>; Zhao,
>>>>> Victor <Victor.Zhao@xxxxxxx>
>>>>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Philip
>>>>> <Philip.Yang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>>>>> Subject: RE: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ
>>>>> and wb pool
>>>>>
>>>>> Christian/Lijo
>>>>>
>>>>> We verified all approaches, and we know what works and not works, obviously the mb() doesn't work.
>>>>>
>>>>> The way of mb() between set wptr_polling and kicking off doorbell is theoretically correct, and I'm not object to do so... but this won't resolve the issue we hit.
>>>>> First of all, USWC will have some chance that the data is still in CPU's WC storage place and not flushed to the memory and even with MB() get rid of re-ordering GPU might still have a chance to read stalled data from ring buffer as long as it is mapped USWC.
>>>>>
>>>>> This is why only cache plus snoop memory can get rid of inconsistence issues.
>>>>>
>>>>> Besides, do you know what's the rational to keep all GFX KCQ cache+snoop but only HIQ/KIQ from KFD configured to USWC ?
>>>>>
>>>>> For performance concern that looks to me always the second priority compared to "correct" especially under the case HIQ contributes very little to ROCM performance when switching to cache mapping.
>>>>>
>>>>>
>>>>> Monk Liu | Cloud GPU & Virtualization | AMD
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>> Sent: Thursday, November 7, 2024 7:57 PM
>>>>> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; Alex Deucher
>>>>> <alexdeucher@xxxxxxxxx>; Zhao, Victor <Victor.Zhao@xxxxxxx>
>>>>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Liu, Monk <Monk.Liu@xxxxxxx>;
>>>>> Yang, Philip <Philip.Yang@xxxxxxx>; Kuehling, Felix
>>>>> <Felix.Kuehling@xxxxxxx>
>>>>> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ
>>>>> and wb pool
>>>>>
>>>>> Am 07.11.24 um 06:58 schrieb Lazar, Lijo:
>>>>>> On 11/6/2024 8:42 PM, Alex Deucher wrote:
>>>>>>> On Wed, Nov 6, 2024 at 1:49 AM Victor Zhao <Victor.Zhao@xxxxxxx> wrote:
>>>>>>>> From: Monk Liu <Monk.Liu@xxxxxxx>
>>>>>>>>
>>>>>>>> As cache GTT buffer is snooped, this way the coherence between
>>>>>>>> CPU write and GPU fetch is guaranteed, but original code uses
>>>>>>>> WC
>>>>>>>> + unsnooped for HIQ PQ(ring buffer) which introduces coherency issues:
>>>>>>>> MEC fetches a stall data from PQ and leads to MEC hang.
>>>>>>> Can you elaborate on this?  I can see CPU reads being slower
>>>>>>> because the memory is uncached, but the ring buffer is mostly writes anyway.
>>>>>>> IIRC, the driver uses USWC for most if not all of the other ring
>>>>>>> buffers managed by the kernel.  Why aren't those a problem?
>>>>>> We have this on other rings -
>>>>>>            mb();
>>>>>>            amdgpu_ring_set_wptr(ring);
>>>>>>
>>>>>> I think the solution should be to use barrier before write
>>>>>> pointer updates rather than relying on PCIe snooping.
>>>>> Yeah, completely agree as well. The barrier also takes care of preventing the compiler from re-ordering writes.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>> Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
>>>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> index 1f1d79ac5e6c..fb087a0ff5bc 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> @@ -779,7 +779,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>>>>>            if (amdgpu_amdkfd_alloc_gtt_mem(
>>>>>>>>                            kfd->adev, size, &kfd->gtt_mem,
>>>>>>>>                            &kfd->gtt_start_gpu_addr, &kfd->gtt_start_cpu_ptr,
>>>>>>>> -                       false, true)) {
>>>>>>>> +                       false, false)) {
>>>>>>>>                    dev_err(kfd_device, "Could not allocate %d bytes\n", size);
>>>>>>>>                    goto alloc_gtt_mem_failure;
>>>>>>>>            }
>>>>>>>> --
>>>>>>>> 2.34.1
>>>>>>>>
>