On Thu, Nov 19, 2020 at 04:27:00PM +0100, Christian König wrote: > Am 18.11.20 um 22:15 schrieb Daniel Vetter: > > On Wed, Nov 18, 2020 at 02:35:24PM +0100, Christian König wrote: > > > Am 17.11.20 um 18:19 schrieb Daniel Vetter: > > > > On Tue, Nov 17, 2020 at 03:06:15PM +0100, Christian König wrote: > > > > > Increase the ammount of system memory drivers can use to about 90% of > > > > > the total available. > > > > > > > > > > Signed-off-by: Christian König <christian.koenig@xxxxxxx> > > > > > --- > > > > > drivers/gpu/drm/ttm/ttm_bo.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > > > > > index a958135cb3fe..0a93df93dba4 100644 > > > > > --- a/drivers/gpu/drm/ttm/ttm_bo.c > > > > > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > > > > > @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void) > > > > > * the available system memory. > > > > > */ > > > > > num_pages = (u64)si.totalram * si.mem_unit; > > > > > - num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT; > > > > > + num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT; > > > > I don't think this is the design we want. As long as it was set at "half > > > > of system memory" it was clear that a) it's a hack b) precision didn't > > > > matter. > > > Yeah, I'm also wondering what to do here exactly. In generally I would > > > completely nuke that limit if possible. > > > > > > > But if you go to the limit and still want to keep the "we make sure > > > > there's no OOM", then precision starts to matter: > > > > - memory hotplug and hotunplug is a thing > > > > - userspace can mlock, and it's configureable > > > > - drivers can pin_user_pages for IO and random other stuff. Some of it is > > > > accounted as some subsystem specific mlock (like rdma does iirc), some > > > > is just yolo or short term enough (like) > > > > - none of what we do here takes into considerations any interactions with > > > > core mm tracking (like cgroups or numa or anything like that) > > > OOM is perfectly fine with me, we should just not run into an OOM killer > > > situation because we call shmem_read_mapping_page_gfp() in the shrinker > > > callback. > > > > > > Any idea how we could guarantee that? > > Uh ... > > > > I just realized I missed something between how ttm works and how i915 > > works. With i915 directly using shmem, all we do in the shrinker is unpin > > the shmem pages. With ttm we have the page pool in the middle. And it's > > needed for dma coherent and other things. This is rather fundamental. > > Yeah, the WC/UC handling is otherwise way to slow. Only for normal WB > allocations we could do this. > > > btw I don't think you'll get OOM, you get lockdep splat because you're > > recusring on fs_reclaim fake lock. We should probably put a might_alloc() > > into shmem_read_mapping_page_gfp(). > > Yeah, one reason why I haven't enabled that yet. > > > > > > > If we want to drop the "half of system ram" limit (and yes that makes > > > > sense) I think the right design is: > > > > > > > > - we give up on the "no OOM" guarantee. > > > > > > > > - This means if you want real isolation of tasks, we need cgroups, and we > > > > need to integrate ttm cgroups with system memory cgroups somehow. Unlike > > > > randomly picked hardcoded limits this should work a lot more reliably > > > > and be a lot more useful in practical use in the field. > > > > > > > > - This also means that drivers start to fail in interesting ways. I think > > > > locking headaches are covered with all the lockdep annotations I've > > > > pushed, plus some of the things I still have in-flight (I have a > > > > might_alloc() annotations somewhere). That leaves validation of error > > > > paths for when allocations fail. Ime a very effective way we used in > > > > i915 is (ab)using EINTR restarting, which per drmIoctl uapi spec is > > > > requried. We could put a debug mode into ttm_tt which randomly fails > > > > with -EINTR to make sure it's all working correctly (plus anything else > > > > that allocates memory), and unlike real out-of-memory injection piglit > > > > and any other cts will complete without failure. Which gives us an > > > > excellent metric for "does it work". Actualy OOM, even injected one, > > > > tends to make stuff blow up in a way that's very hard to track and make > > > > sure you've got good coverage, since all your usual tests die pretty > > > > quickly. > > > > > > > > - ttm_tt needs to play fair with every other system memory user. We need > > > > to register a shrinker for each ttm_tt (so usually one per device I > > > > guess), which walks the lru (in shrink_count) and uses dma_resv_trylock > > > > for actual shrinking. We probably want to move it to SYSTEM first for > > > > that shrinker to pick up, so that there's some global fairness across > > > > all ttm_tt. > > > I already have patches for this. > > > > > > What's still missing is teaching the OOM killer which task is using the > > > memory since memory referenced through the file descriptors are currently > > > not accounted towards any process resources. > > Yeah I don't think we've fixed that problem for i915 either. It loves to > > kill randome other stuff. In igt we solve this by marking any igt testcase > > as "pls kill this first". Well "solve". > > Well that is a "creative" solution :) > > I will be rather looking into if we can somehow track to which files_struct > a file belongs to and if we somehow can use this in the OOM killer. > > My last try of giving each file something wasn't received well. > > The real boomer is that you can really nicely create a deny of service using > memfd_create() because of this :) Yeah proper accounting is the right fix here I agree. > > > > - for GFP_DMA32 that means zone aware shrinkers. We've never used those, > > > > because thus far i915 didn't have any big need for low memory, so we > > > > haven't used this in practice. But it's supposed to be a thing. > > > I think we can mostly forget about GFP_DMA32, this should only be used for > > > AGP and other ancient hardware. > > Ok, that's good. So having an arbitrary "only half of GFP_DMA32" for these > > should also be acceptable, since back then gpus didn't really have > > gigabytes of vram yet. Biggest r500 seems to have topped out at 512MB. > > > > How much do we require the dma page pool? Afaiui it's only when there's > > bounce buffers or something nasty like that present. Are there more use > > cases? > > Not that I know off. > > > > > For fixing the TT -> SYSTEM problem of calling shmem_read_mapping_page_gfp > > from shrinker I think there's 3 solutions: > > > > 1)Call shmem_read_mapping_page_gfp with GFP_IO. This way we should not be > > calling back into our shrinker, which are at GFP_FS level. This isn't > > great, but since page reclaim is a giantic loop which repeats as long as > > we're making meaningful forward progress, we should be able to trickle > > TT pages to SYSTEM pages even under severe memory pressure. > > This is most likely the way to go for us. GFP_NOFS is the flag, I mixed them up. > > 2)For the normal page pool (i.e. neither GFP_DMA32 nor dma_alloc_coherent > > or write-combine) we stop copying the pages for TT <-> SYSTEM, but just > > directly pin the pages we get from shmem. Like all the shmem based soc > > drivers do. No copying, no need for allocations, no problem in > > shrinkers. > > Not really doable, WC is just to widely used and shmem doesn't support > larger orders. Hm I thought you get large pages automatically, if you reassemble them. It's just the interface that sucks. btw did you look at 3 below? It might break dma-api assumptions a bit too badly since we're doing nice page flushing behind everyone's back. -Daniel > > Regards, > Christian. > > > > > 3)For the remaining page pools we'd need ttmfs so that we can control the > > address_space_operations and directly swap out from our pages to > > swapout, bypassing shmem and the copying. Iirc Chris Wilson had a > > prototype once somewhere under gemfs for this. Only needed if 1) isn't > > fast enough for these because there's too many such cases where we care. > > > > At least on intel hw the direction is very much towards fully coherent, > > so 1&2 should be good enough. > > -Daniel > > > > > Christian. > > > > > > > It's a bit more code than the oneliner above, but I also think it's a lot > > > > more solid. Plus it would resolve the last big issue where i915 gem is > > > > fairly fundamentally different compared to ttm. For that question I think > > > > once Maarten's locking rework for i915 has landed and all the other ttm > > > > rework from you and Dave is in, we've resolved them all. > > > > > > > > > > > > > /* But for DMA32 we limit ourself to only use 2GiB maximum. */ > > > > > num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit; > > > > > -- > > > > > 2.25.1 > > > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel