On 2019-06-13 8:59, Yang, Philip wrote: > On 2019-06-13 4:54 a.m., Koenig, Christian wrote: >> Am 12.06.19 um 23:13 schrieb Yang, Philip: >>> On 2019-06-12 3:28 p.m., Christian König wrote: >>>> Am 12.06.19 um 17:13 schrieb Yang, Philip: >>>>> TTM create two zones, kernel zone and dma32 zone for system memory. If >>>>> system memory address allocated is below 4GB, this account to dma32 zone >>>>> and will exhaust dma32 zone and trigger unnesssary TTM eviction. >>>>> >>>>> Patch "drm/ttm: Account for kernel allocations in kernel zone only" only >>>>> handle the allocation for acc_size, the system memory page allocation is >>>>> through ttm_mem_global_alloc_page which still account to dma32 zone if >>>>> page is below 4GB. >>>> NAK, as the name says the mem_glob is global for all devices in the system. >>>> >>>> So this will break if you mix DMA32 and non DMA32 in the same system >>>> which is exactly the configuration my laptop here has :( >>>> >>> I didn't find path use dma32 zone, but I may missed something. >> Well the point is there is non in our driver, but many drivers in the >> system still need DMA32 memory. >> >>> There is >>> an issue found by KFDTest.BigBufStressTest, it allocates buffers up to >>> 3/8 of total 256GB system memory, each buffer size is 128MB, then use >>> queue to write to the buffers. If ttm_mem_global_alloc_page get page pfn >>> is below 4GB, it account to dma32 zone and will exhaust 2GB limit, then >>> ttm_check_swapping will schedule ttm_shrink_work to start eviction. It >>> takes minutes to finish restore (retry many times if busy), the test >>> failed because queue timeout. This eviction is unnecessary because we >>> still have enough free system memory. >> No that is definitely necessary. For example on my Laptop it's the sound >> system which needs DMA32. >> >> Without this when an application uses a lot of memory we run into the >> OOM killer as soon as some audio starts playing. >> > I did not realize TTM is used by other drivers. I agree we cannot remove > dma32 zone, this will break other device drivers which depends on dma32 > zone. If I understand Christian correctly, the point is not that other drivers use TTM, but other drivers need dma32 memory (memory with 32-bit physical addresses). If TTM used up all that memory, it would break those other drivers. As a good steward of memory, TTM limits its usage of dma32 memory in order to allow other drivers to have access to it. Regards, Felix > >>> It's random case, happens about 1/5. I can change test to increase the >>> timeout value to workaround this, but this seems TTM bug. This will slow >>> application performance a lot if this random issue happens. >> One thing we could try is to dig into why the kernel gives us DMA32 >> pages when there are still other pages available. Please take a look at >> /proc/buddyinfo on that box for this. >> > Thanks for the advise, from buddyinfo, the machine has 4 nodes, each > node has 64GB memory, and dma32 zone is on node 0. BigBufStressTest > allocate 96GB. If I force the test on node 1, "numactl --cpunodebind=1 > ./kfdtest", no eviction at all. If I force the test on node 0, it always > has eviction and restore because it used up all memory including dma32 > zone from node 0. I will change test app to avoid running on node 0 to > workaround this. > > Thanks, > Philip > >> The next thing coming to mind is that we can most likely completely skip >> this if IOMMU is active. >> >> The last thing of hand coming to my mind is to improve TTM to actually >> only evict BOs which use DMA32 memory, cause currently we just evict >> stuff randomly and not check if its DMA32 or other memory. >> >> Regards, >> Christian. >> >>> Thanks, >>> Philip >>> >>> >>>> Christian. >>>> >>>>> Change-Id: I289b85d891b8f64a1422c42b1eab398098ab7ef7 >>>>> Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 ++++ >>>>> 1 file changed, 4 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>>> index 2778ff63d97d..79bb9dfe617b 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>>> @@ -1686,6 +1686,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) >>>>> } >>>>> adev->mman.initialized = true; >>>>> + /* Only kernel zone (no dma32 zone) if device does not require >>>>> dma32 */ >>>>> + if (!adev->need_dma32) >>>>> + adev->mman.bdev.glob->mem_glob->num_zones = 1; >>>>> + >>>>> /* We opt to avoid OOM on system pages allocations */ >>>>> adev->mman.bdev.no_retry = true; > _______________________________________________ > amd-gfx mailing list > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx