Re: [PATCH v14 3/8] drm/ttm/pool: Provide a helper to shrink pages

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Wed, 04 Dec 2024 10:56:52 +0100

On Wed, 2024-12-04 at 10:16 +0100, Christian König wrote:
> Am 03.12.24 um 18:44 schrieb Thomas Hellström:
> > On Tue, 2024-12-03 at 17:46 +0100, Christian König wrote:
> > > Am 03.12.24 um 17:43 schrieb Thomas Hellström:
> > > > On Tue, 2024-12-03 at 17:39 +0100, Christian König wrote:
> > > > > Am 03.12.24 um 17:31 schrieb Thomas Hellström:
> > > > > > On Tue, 2024-12-03 at 17:20 +0100, Christian König wrote:
> > > > > > > [SNIP]
> > > > > > > > > > > > @@ -453,9 +601,36 @@ int ttm_pool_alloc(struct
> > > > > > > > > > > > ttm_pool
> > > > > > > > > > > > *pool,
> > > > > > > > > > > > struct ttm_tt *tt,
> > > > > > > > > > > >        	else
> > > > > > > > > > > >        		gfp_flags |= GFP_HIGHUSER;
> > > > > > > > > > > >        
> > > > > > > > > > > > -	for (order = min_t(unsigned int,
> > > > > > > > > > > > MAX_PAGE_ORDER,
> > > > > > > > > > > > __fls(num_pages));
> > > > > > > > > > > > -	     num_pages;
> > > > > > > > > > > > -	     order = min_t(unsigned int,
> > > > > > > > > > > > order,
> > > > > > > > > > > > __fls(num_pages)))
> > > > > > > > > > > > {
> > > > > > > > > > > > +	order = min_t(unsigned int,
> > > > > > > > > > > > MAX_PAGE_ORDER,
> > > > > > > > > > > > __fls(num_pages));
> > > > > > > > > > > > +
> > > > > > > > > > > > +	if (tt->page_flags &
> > > > > > > > > > > > TTM_TT_FLAG_PRIV_BACKED_UP) {
> > > > > > > > > > > > +		if (!tt->restore) {
> > > > > > > > > > > > +			gfp_t gfp = GFP_KERNEL
> > > > > > > > > > > > |
> > > > > > > > > > > > __GFP_NOWARN;
> > > > > > > > > > > > +
> > > > > > > > > > > > +			if (ctx-
> > > > > > > > > > > > > gfp_retry_mayfail)
> > > > > > > > > > > > +				gfp |=
> > > > > > > > > > > > __GFP_RETRY_MAYFAIL;
> > > > > > > > > > > > +
> > > > > > > > > > > > +			tt->restore =
> > > > > > > > > > > > +				kvzalloc(struc
> > > > > > > > > > > > t_si
> > > > > > > > > > > > ze(t
> > > > > > > > > > > > t-
> > > > > > > > > > > > > restore,
> > > > > > > > > > > > old_pages,
> > > > > > > > > > > > +					
> > > > > > > > > > > > 	
> > > > > > > > > > > > (size_t)1
> > > > > > > > > > > > <<
> > > > > > > > > > > > order), gfp);
> > > > > > > > > > > > +			if (!tt->restore)
> > > > > > > > > > > > +				return -
> > > > > > > > > > > > ENOMEM;
> > > > > > > > > > > > +		} else if
> > > > > > > > > > > > (ttm_pool_restore_valid(tt-
> > > > > > > > > > > > > restore)) {
> > > > > > > > > > > > +			struct
> > > > > > > > > > > > ttm_pool_tt_restore
> > > > > > > > > > > > *restore =
> > > > > > > > > > > > tt-
> > > > > > > > > > > > > restore;
> > > > > > > > > > > > +
> > > > > > > > > > > > +			num_pages -= restore-
> > > > > > > > > > > > > alloced_pages;
> > > > > > > > > > > > +			order = min_t(unsigned
> > > > > > > > > > > > int,
> > > > > > > > > > > > order,
> > > > > > > > > > > > __fls(num_pages));
> > > > > > > > > > > > +			pages += restore-
> > > > > > > > > > > > > alloced_pages;
> > > > > > > > > > > > +			r =
> > > > > > > > > > > > ttm_pool_restore_tt(restore,
> > > > > > > > > > > > tt-
> > > > > > > > > > > > > backup, ctx);
> > > > > > > > > > > > +			if (r)
> > > > > > > > > > > > +				return r;
> > > > > > > > > > > > +			caching = restore-
> > > > > > > > > > > > > caching_divide;
> > > > > > > > > > > > +		}
> > > > > > > > > > > > +
> > > > > > > > > > > > +		tt->restore->pool = pool;
> > > > > > > > > > > > +	}
> > > > > > > > > > > Hui? Why is that part of the allocation function
> > > > > > > > > > > now?
> > > > > > > > > > > 
> > > > > > > > > > > At bare minimum I would expect that this is a new
> > > > > > > > > > > function.
> > > > > > > > > > It's because we now have partially backed up tts,
> > > > > > > > > > so
> > > > > > > > > > the
> > > > > > > > > > restore is
> > > > > > > > > > interleaved on a per-page basis, replacing the
> > > > > > > > > > backup
> > > > > > > > > > handles
> > > > > > > > > > with
> > > > > > > > > > page-pointers. I'll see if I can separate out at
> > > > > > > > > > least
> > > > > > > > > > the
> > > > > > > > > > initialization here.
> > > > > > > > > Yeah, that kind of makes sense.
> > > > > > > > > 
> > > > > > > > > My expectation was just that we now have explicit
> > > > > > > > > ttm_pool_swapout()
> > > > > > > > > and
> > > > > > > > > ttm_pool_swapin() functions.
> > > > > > > > I fully understand, although in the allocation step,
> > > > > > > > that
> > > > > > > > would
> > > > > > > > also
> > > > > > > > increase the memory pressure since we might momentarily
> > > > > > > > have
> > > > > > > > twice
> > > > > > > > the
> > > > > > > > bo-size allocated, if the shmem object was never
> > > > > > > > swapped
> > > > > > > > out,
> > > > > > > > and
> > > > > > > > we
> > > > > > > > don't want to unnecessarily risc OOM at recover time,
> > > > > > > > although
> > > > > > > > that
> > > > > > > > should be a recoverable situation now. If the OOM
> > > > > > > > receiver
> > > > > > > > can
> > > > > > > > free
> > > > > > > > up
> > > > > > > > system memory resources they can could potentially
> > > > > > > > restart
> > > > > > > > the
> > > > > > > > recover.
> > > > > > > What I meant was more that we have ttm_pool_swapout()
> > > > > > > which
> > > > > > > does
> > > > > > > a
> > > > > > > mix
> > > > > > > of moving each page to a swap backend and freeing one by
> > > > > > > one.
> > > > > > > 
> > > > > > > And ttm_pool_swapin() which allocates a bit of memory
> > > > > > > (usually
> > > > > > > one
> > > > > > > huge
> > > > > > > page) and then copies the content back in from the swap
> > > > > > > backend.
> > > > > > > 
> > > > > > > Alternatively we could rename ttm_pool_alloc() into
> > > > > > > something
> > > > > > > like
> > > > > > > ttm_pool_populate() and ttm_pool_free() into
> > > > > > > ttm_pool_unpopulate(),
> > > > > > > but
> > > > > > > those names are not very descriptive either.
> > > > > > > 
> > > > > > > It's just that we now do a bit more than just alloc and
> > > > > > > free
> > > > > > > in
> > > > > > > those
> > > > > > > functions, so the naming doesn't really match that well
> > > > > > > any
> > > > > > > more.
> > > > > > So what about ttm_pool_alloc() and
> > > > > > ttm_pool_recover/swapin(),
> > > > > > both
> > > > > > pointing to the same code, but _alloc() asserts that the tt
> > > > > > isn't
> > > > > > backed up?
> > > > > > 
> > > > > > That would give a clean interface at least.
> > > > > More or less ok. I would just put figuring out the gfp flags
> > > > > and
> > > > > the
> > > > > stuff inside the for (order... loop into separate functions.
> > > > > And
> > > > > then
> > > > > remove the if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP)
> > > > > from
> > > > > the
> > > > > pool.
> > > > > 
> > > > > In other words you trigger the back restore by calling a
> > > > > different
> > > > > function than the allocation one.
> > > > I'll take a look at this as well.
> > > Ah, and BTW: It's perfectly possible that ttm_tt_free() is called
> > > because a halve swapped TT is about to be destroyed!
> > > 
> > > If I'm not completely mistaken that is not handled gracefully
> > > when we
> > > try to always backup from in the ttm_tt_free() function.
> > > 
> > > So we clearly need the separation of move this TT to a backup
> > > (and
> > > eventually only partially) and freeing it.
> > Hm. I'm not sure I follow completely.
> > 
> > The ttm_pool interface is currently:
> > 
> > ttm_pool_alloc() -> allocs and may recover from backup. May leave
> > partially backed up. Called from ttm_tt_populate() or its driver
> > callbacks.
> 
> Yeah that this is done by a single function looks really strange to
> me.
> 
> > ttm_pool_backup_tt() -> Attempts to back up (the not already backed
> > up
> > part of a tt. Called from ttm_tt_backup(), which is just a tt layer
> > wrapper. If called with purge==true, then frees memory bypassing
> > the
> > pool to return it to the system directly.
> > 
> > ttm_pool_free() -> Frees a (potentially backed up or partially
> > backed
> > up) tt. Called from ttm_tt_unpopulate() or its driver callbacks.
> 
> Ah! I missed that you have separated that functionality from the free
> path.
> 
> I've only saw the allocation path and though I need to clear that up
> first.
> 
> > So the backup functionality is implemented with a minimal change to
> > upper layers, and I don't think there is a correctness problem on
> > free().
> > 
> > So could you clarify a bit if it is this interface you think needs
> > changing or that the implementation is better at separating out the
> > backup functionality from the pool functionality?
> 
> I think we should just make the ttm pool object take charge of 
> allocation, backup, restore and free operation on the TT objects.
> 
> And all of those are separate operations, they just internally share 
> steps to archive what they should do.

So are we looking at an interface change like:

ttm_pool_alloc() // no recover. Errors if backed-up-data present.
ttm_pool_alloc_and_recover() // because you can't alloc first and then
recover in a memory-efficient manner, since you need to interleave.
ttm_pool_backup() // as currently
ttm_pool_drop_backed_up() //drops the backed-up data if any.
ttm_pool_free() // frees all data. errors if backed-up-data present.

Is this what you mean?

> 
> BTW I really dislike that tt->restore is allocated dynamically. That
> is 
> just another allocation which can cause problems.

> 
> We should probably have all the state necessary for the operation in
> the 
> TT object.

Initially it was done this way. But that meant a pre-allocated struct
page-pointer array the of 1 << MAX_PAGE_ORDER size (2MiB) for each
ttm_tt. That lead to a patch to reduce the MAX_PAGE_ORDER to PMD size
order, but  as you might remember, that needed to be ripped out because
the PMD size macros aren't constant across all architectures. IIRC it
was ARM causing compilation failures, and Linus wasn't happy.

So, enter the dynamic allocation which is temporary, and 1/512 of the
size of the memory we need to allocate for the buffer object. IIRC that
was discussed with Matt when he reiewed and we concluded that it should
be ok. I think this approach leads to less memory pressure than if we'd
keep that array around all the time for *all* the allocated bos, and
the allocation is never during reclaim.

Thanks,
Thomas

> 
> Regards,
> Christian.
> 
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > 
> > 
> > > Christian.
> > > 
> > > > /Thomas
> > > > 
> > > > 
> > > > > > For a renaming change that touch all TTM drivers, I'd
> > > > > > rather
> > > > > > put
> > > > > > that
> > > > > > as a last patch since getting acks for that from all TTM
> > > > > > driver
> > > > > > maintainers seems like a hopeless undertaking.
> > > > > Yeah the acks are not the problem, merging it through the xe
> > > > > tree
> > > > > would be.
> > > > > 
> > > > > Christian.
> > > > > 
> > > > > 
> > > > > > /Thomas
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > /Thomas
>