[PATCH 14/25] drm/amdkfd: Populate DRM render device minor

christian.koenig@xxxxxxx (Christian König) · Wed, 14 Feb 2018 08:42:13 +0100

Am 14.02.2018 um 00:17 schrieb Felix Kuehling:
> On 2018-02-13 02:18 PM, Felix Kuehling wrote:
>> On 2018-02-13 01:15 PM, Christian KÃ¶nig wrote:
>>> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>>>> On 2018-02-13 12:06 PM, Christian KÃ¶nig wrote:
>>>>> [SNIP]
>>>>> Ah, yeah that is also a point I wanted to to talk about with you.
>>>>>
>>>>> The approach of using the same buffer object with multiple amdgpu
>>>>> devices doesn't work in general.
>>>>>
>>>>> We need separate TTM object for each BO in each device or otherwise we
>>>>> break A+A laptops.
>>>> I think it broke for VRAM BOs because we enabled P2P on systems that
>>>> didn't support it properly. But at least system memory BOs can be shared
>>>> quite well between devices and we do it all the time.
>>> Sharing VRAM BOs is one issue, but the problem goes deeper than just
>>> that.
>>>
>>> Starting with Carizzo we can scanout from system memory to avoid the
>>> extra copy on A+A laptops. For this to work we need the BO mapped to
>>> GART (and I mean a real VMID0 mapping, not just in the GTT domain).
>>> And for this to work in turn we need a TTM object per device and not a
>>> global one.
>> I still don't understand. I think what you're talking about applies only
>> to BOs used for scan-out. Every BO is allocated from a specific device
>> and can only be GART-mapped on that device.

Exactly that assumption is incorrect. BOs can be GART mapped into any 
device.

>>   What we do is map the same
>> BO in VMs on other devices. It has no effect on GART mappings.

Correct VM mapping is unaffected here.

>>>> I don't see how you can have separate TTM objects referring to the
>>>> same memory.
>>> Well that is trivial, we do this all the time with prime and I+A laptops.
>> As I understand it, you use DMABuf to export/import buffers on multiple
>> devices. I believe all devices share a single amdgpu_bo, which contains
>> the ttm_buffer_object.

That's incorrect as well. Normally multiple devices have multiple 
ttm_buffer_object, one for each device.

Going a bit higher that actually makes sense because the status of each 
BO is deferent for each device. E.g. one device could have the BO in 
access while it could be idle on another device.

Same is true for placement of the BO. E.g. a VRAM placement of one 
device is actually a GTT placement for another.

>>   The only way you can have a new TTM buffer object
>> per device is by using SG tables and pinning the BOs. But I think we
>> want to avoid pinning BOs.
>>
>> What we do does not involve pinning of BOs, even when they're shared
>> between multiple devices' VMs.
>>
>>>>> That is also the reason we had to disable this feature again in the
>>>>> hybrid branches.
>>>> What you disabled on hybrid branches was P2P, which only affects
>>>> large-BAR systems. Sharing of system memory BOs between devices still
>>>> works fine.
>>> No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
>>> and Raven. Additional to that we found that it breaks some aspects of
>>> the user space interface.
>> Let me check that with my current patch series. The patches I submitted
>> here shouldn't include anything that breaks the use cases you describe.
>> But I'm quite sure it will support sharing BOs between multiple devices'
>> VMs.
> Confirmed with the current patch series. I can allocate buffers on GPU 1
> and map them into a VM on GPU 2.

As I said VM mapping is not the problem here.

> BTW, this is the normal mode of operation for system memory BOs on
> multi-GPU systems. Logically system memory BOs belong to the CPU node in
> the KFD topology. So the Thunk API isn't told which GPU to allocate the
> system memory BO from. It just picks the first one. Then we can map
> those BOs on any GPUs we want. If we want to use them on GPU2, that's
> where they get mapped.

Well keeping NUMA in mind that actually sounds like a design problem to me.

On NUMA system some parts of the system memory can be "closer" to a GPU 
than other parts.

> So I just put two GPUs in a system and ran a test on GPU2. The system
> memory buffers are allocated from the GPU1 device, but mapped into the
> GPU2 VM. The tests work normally.
>
> If this is enabled by any changes that break existing buffer sharing for
> A+A or A+I systems, please point it out to me. I'm not aware that this
> patch series does anything to that effect.

As I said it completely breaks scanout with A+I systems.

Over all it looks like the change causes more problems than it solves. 
So I'm going to block upstreaming it until we have found a way to make 
it work for everybody.

Regards,
Christian.

>
> Regards,
>  Â  Felix
>
>> Regards,
>>  Â  Felix
>>
>>> So end result is that we probably need to revert it and find a
>>> different solution. I'm already working on this for a couple of weeks
>>> now and should have something ready after I'm done with the PASID
>>> handling.
>>>
>>> Regards,
>>> Christian.