On 2018-04-20 09:40 PM, Felix Kuehling wrote: > On 2018-04-20 10:47 AM, Michel Dänzer wrote: >> On 2018-04-11 11:37 AM, Christian König wrote: >>> Am 11.04.2018 um 06:00 schrieb Gabriel C: >>>> 2018-04-09 11:42 GMT+02:00 Christian König >>>> <ckoenig.leichtzumerken@xxxxxxxxx>: >>>>> Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin: >>>>>> Hi Christian, >>>>>> >>>>>> Thanks for the info. FYI, I've also opened a Firefox bug for that at: >>>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1448778 >>>>>> Feel free to comment since you have a better understanding of what's >>>>>> going on. >>>>>> >>>>>> One last question: right now I'm running 4.15.0 with the "offending" >>>>>> patch reverted. Is that safe to run or are there possible bad >>>>>> interactions with other changes. >>>>> That should work without problems. >>>>> >>>>> But I just had another idea as well, if you want you could still test >>>>> the >>>>> new code path which will be using in 4.17. >>>>> >>>> While Firefox may do some strange things is not about only Firefox. >>>> >>>> With your patches my EPYC box is unusable with 4.15++ kernels. >>>> The whole Desktop is acting weird. This one is using >>>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU. >>>> >>>> Box is 2 * EPYC 7281 with 128 GB ECC RAM >>>> >>>> Also a 14C Xeon box with a HD7700 is broken same way. >>> The hardware is irrelevant for this. We need to know what software stack >>> you use on top of it. >>> >>> E.g. desktop environment/Mesa and DDX version etc... >>> >>>> Everything breaks in X .. scrolling , moving windows , flickering etc. >>>> >>>> >>>> reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and >>>> 648bc3574716400acc06f99915815f80d9563783 >>>> from an 4.15 kernel makes things work again. >>>> >>>> >>>>> Backporting all the detection logic is to invasive, but you could >>>>> just go >>>>> into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other >>>>> code path. >>>>> >>>>> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those. >>>>> >>>> Well you really can't be serious about these suggestions ? Are you ? >>>> >>>> Telling peoples to #if 0 random code is not a solution. >>> That is for testing and not a permanent solution. >>> >>>> You broke existsing working userland with your patches and at least >>>> please fix that for 4.16. >>>> >>>> I can help testing code for 4.17/++ if you wish but that is >>>> *different* storry. >>> Please test Alex's amd-staging-drm-next branch from >>> git://people.freedesktop.org/~agd5f/linux. >> I think we're still missing something here. >> >> I'm currently running 4.16.2 + the DRM subsystem changes which are going >> into 4.17 (so I have the changes Christian is referring to) with a >> Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc. >> Some observations: >> >> Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the >> order of a minute, during which the kernel is spending most of one >> core's cycles inside alloc_pages (__alloc_pages_nodemask to be more >> precise), called from ttm_alloc_new_pages. > Philip debugged a similar problem with a KFD memory stress test about > two weeks ago, where the kernel was seemingly stuck in an infinite loop > trying to allocate huge pages. I'm pasting his analysis for the record: > >> [...] it uses huge_flags GFP_TRANSHUGE to call alloc_pages(), this >> seems a corner case inside __alloc_pages_slowpath(), it never exits >> but goes to retry path every time. It can reclaim pages and >> did_some_progress (as a result, no_progress_loops is reset to 0 every >> loop, never reach MAX_RECLAIM_RETRIES) but cannot finish huge page >> allocations under this specific memory pressure. > As a workaround to unblock our release branch testing we removed > transparent huge page allocation from ttm_get_pages. We're seeing this > as far back as 4.13 on our release branch. Thanks for sharing this. In the future, please raise issues like this on the public mailing lists from the beginning. > If we're really talking about the same problem, I don't think it's > caused by recent page allocator changes, but rather exposed by recent > TTM changes. It sounds related, but probably not exactly the same problem. I already had the TTM code using GFP_TRANSHUGE before I ran into the issue. Also, __alloc_pages_slowpath eventually succeeds for me, it can just take up to about a minute. I'm currently testing using (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY) instead of GFP_TRANSHUGE in TTM. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel