On Wed, 2024-12-04 at 11:56 +0100, Christian König wrote: > Am 04.12.24 um 10:56 schrieb Thomas Hellström: > > On Wed, 2024-12-04 at 10:16 +0100, Christian König wrote: > > > Am 03.12.24 um 18:44 schrieb Thomas Hellström: > > > > On Tue, 2024-12-03 at 17:46 +0100, Christian König wrote: > > > > > Am 03.12.24 um 17:43 schrieb Thomas Hellström: > > > > > > On Tue, 2024-12-03 at 17:39 +0100, Christian König wrote: > > > > > > > Am 03.12.24 um 17:31 schrieb Thomas Hellström: > > > > > > > > On Tue, 2024-12-03 at 17:20 +0100, Christian König > > > > > > > > wrote: > > > > > > > > > [SNIP] > > > > > > > > > > > > > > @@ -453,9 +601,36 @@ int > > > > > > > > > > > > > > ttm_pool_alloc(struct > > > > > > > > > > > > > > ttm_pool > > > > > > > > > > > > > > *pool, > > > > > > > > > > > > > > struct ttm_tt *tt, > > > > > > > > > > > > > > else > > > > > > > > > > > > > > gfp_flags |= > > > > > > > > > > > > > > GFP_HIGHUSER; > > > > > > > > > > > > > > > > > > > > > > > > > > > > - for (order = min_t(unsigned int, > > > > > > > > > > > > > > MAX_PAGE_ORDER, > > > > > > > > > > > > > > __fls(num_pages)); > > > > > > > > > > > > > > - num_pages; > > > > > > > > > > > > > > - order = min_t(unsigned int, > > > > > > > > > > > > > > order, > > > > > > > > > > > > > > __fls(num_pages))) > > > > > > > > > > > > > > { > > > > > > > > > > > > > > + order = min_t(unsigned int, > > > > > > > > > > > > > > MAX_PAGE_ORDER, > > > > > > > > > > > > > > __fls(num_pages)); > > > > > > > > > > > > > > + > > > > > > > > > > > > > > + if (tt->page_flags & > > > > > > > > > > > > > > TTM_TT_FLAG_PRIV_BACKED_UP) { > > > > > > > > > > > > > > + if (!tt->restore) { > > > > > > > > > > > > > > + gfp_t gfp = > > > > > > > > > > > > > > GFP_KERNEL > > > > > > > > > > > > > > > > > > > > > > > > > > > > > __GFP_NOWARN; > > > > > > > > > > > > > > + > > > > > > > > > > > > > > + if (ctx- > > > > > > > > > > > > > > > gfp_retry_mayfail) > > > > > > > > > > > > > > + gfp |= > > > > > > > > > > > > > > __GFP_RETRY_MAYFAIL; > > > > > > > > > > > > > > + > > > > > > > > > > > > > > + tt->restore = > > > > > > > > > > > > > > + kvzalloc(s > > > > > > > > > > > > > > truc > > > > > > > > > > > > > > t_si > > > > > > > > > > > > > > ze(t > > > > > > > > > > > > > > t- > > > > > > > > > > > > > > > restore, > > > > > > > > > > > > > > old_pages, > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > > > > > > > > > > > > (size_t)1 > > > > > > > > > > > > > > << > > > > > > > > > > > > > > order), gfp); > > > > > > > > > > > > > > + if (!tt->restore) > > > > > > > > > > > > > > + return - > > > > > > > > > > > > > > ENOMEM; > > > > > > > > > > > > > > + } else if > > > > > > > > > > > > > > (ttm_pool_restore_valid(tt- > > > > > > > > > > > > > > > restore)) { > > > > > > > > > > > > > > + struct > > > > > > > > > > > > > > ttm_pool_tt_restore > > > > > > > > > > > > > > *restore = > > > > > > > > > > > > > > tt- > > > > > > > > > > > > > > > restore; > > > > > > > > > > > > > > + > > > > > > > > > > > > > > + num_pages -= > > > > > > > > > > > > > > restore- > > > > > > > > > > > > > > > alloced_pages; > > > > > > > > > > > > > > + order = > > > > > > > > > > > > > > min_t(unsigned > > > > > > > > > > > > > > int, > > > > > > > > > > > > > > order, > > > > > > > > > > > > > > __fls(num_pages)); > > > > > > > > > > > > > > + pages += restore- > > > > > > > > > > > > > > > alloced_pages; > > > > > > > > > > > > > > + r = > > > > > > > > > > > > > > ttm_pool_restore_tt(restore, > > > > > > > > > > > > > > tt- > > > > > > > > > > > > > > > backup, ctx); > > > > > > > > > > > > > > + if (r) > > > > > > > > > > > > > > + return r; > > > > > > > > > > > > > > + caching = restore- > > > > > > > > > > > > > > > caching_divide; > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > + > > > > > > > > > > > > > > + tt->restore->pool = pool; > > > > > > > > > > > > > > + } > > > > > > > > > > > > > Hui? Why is that part of the allocation > > > > > > > > > > > > > function > > > > > > > > > > > > > now? > > > > > > > > > > > > > > > > > > > > > > > > > > At bare minimum I would expect that this is a > > > > > > > > > > > > > new > > > > > > > > > > > > > function. > > > > > > > > > > > > It's because we now have partially backed up > > > > > > > > > > > > tts, > > > > > > > > > > > > so > > > > > > > > > > > > the > > > > > > > > > > > > restore is > > > > > > > > > > > > interleaved on a per-page basis, replacing the > > > > > > > > > > > > backup > > > > > > > > > > > > handles > > > > > > > > > > > > with > > > > > > > > > > > > page-pointers. I'll see if I can separate out > > > > > > > > > > > > at > > > > > > > > > > > > least > > > > > > > > > > > > the > > > > > > > > > > > > initialization here. > > > > > > > > > > > Yeah, that kind of makes sense. > > > > > > > > > > > > > > > > > > > > > > My expectation was just that we now have explicit > > > > > > > > > > > ttm_pool_swapout() > > > > > > > > > > > and > > > > > > > > > > > ttm_pool_swapin() functions. > > > > > > > > > > I fully understand, although in the allocation > > > > > > > > > > step, > > > > > > > > > > that > > > > > > > > > > would > > > > > > > > > > also > > > > > > > > > > increase the memory pressure since we might > > > > > > > > > > momentarily > > > > > > > > > > have > > > > > > > > > > twice > > > > > > > > > > the > > > > > > > > > > bo-size allocated, if the shmem object was never > > > > > > > > > > swapped > > > > > > > > > > out, > > > > > > > > > > and > > > > > > > > > > we > > > > > > > > > > don't want to unnecessarily risc OOM at recover > > > > > > > > > > time, > > > > > > > > > > although > > > > > > > > > > that > > > > > > > > > > should be a recoverable situation now. If the OOM > > > > > > > > > > receiver > > > > > > > > > > can > > > > > > > > > > free > > > > > > > > > > up > > > > > > > > > > system memory resources they can could potentially > > > > > > > > > > restart > > > > > > > > > > the > > > > > > > > > > recover. > > > > > > > > > What I meant was more that we have ttm_pool_swapout() > > > > > > > > > which > > > > > > > > > does > > > > > > > > > a > > > > > > > > > mix > > > > > > > > > of moving each page to a swap backend and freeing one > > > > > > > > > by > > > > > > > > > one. > > > > > > > > > > > > > > > > > > And ttm_pool_swapin() which allocates a bit of memory > > > > > > > > > (usually > > > > > > > > > one > > > > > > > > > huge > > > > > > > > > page) and then copies the content back in from the > > > > > > > > > swap > > > > > > > > > backend. > > > > > > > > > > > > > > > > > > Alternatively we could rename ttm_pool_alloc() into > > > > > > > > > something > > > > > > > > > like > > > > > > > > > ttm_pool_populate() and ttm_pool_free() into > > > > > > > > > ttm_pool_unpopulate(), > > > > > > > > > but > > > > > > > > > those names are not very descriptive either. > > > > > > > > > > > > > > > > > > It's just that we now do a bit more than just alloc > > > > > > > > > and > > > > > > > > > free > > > > > > > > > in > > > > > > > > > those > > > > > > > > > functions, so the naming doesn't really match that > > > > > > > > > well > > > > > > > > > any > > > > > > > > > more. > > > > > > > > So what about ttm_pool_alloc() and > > > > > > > > ttm_pool_recover/swapin(), > > > > > > > > both > > > > > > > > pointing to the same code, but _alloc() asserts that > > > > > > > > the tt > > > > > > > > isn't > > > > > > > > backed up? > > > > > > > > > > > > > > > > That would give a clean interface at least. > > > > > > > More or less ok. I would just put figuring out the gfp > > > > > > > flags > > > > > > > and > > > > > > > the > > > > > > > stuff inside the for (order... loop into separate > > > > > > > functions. > > > > > > > And > > > > > > > then > > > > > > > remove the if (tt->page_flags & > > > > > > > TTM_TT_FLAG_PRIV_BACKED_UP) > > > > > > > from > > > > > > > the > > > > > > > pool. > > > > > > > > > > > > > > In other words you trigger the back restore by calling a > > > > > > > different > > > > > > > function than the allocation one. > > > > > > I'll take a look at this as well. > > > > > Ah, and BTW: It's perfectly possible that ttm_tt_free() is > > > > > called > > > > > because a halve swapped TT is about to be destroyed! > > > > > > > > > > If I'm not completely mistaken that is not handled gracefully > > > > > when we > > > > > try to always backup from in the ttm_tt_free() function. > > > > > > > > > > So we clearly need the separation of move this TT to a backup > > > > > (and > > > > > eventually only partially) and freeing it. > > > > Hm. I'm not sure I follow completely. > > > > > > > > The ttm_pool interface is currently: > > > > > > > > ttm_pool_alloc() -> allocs and may recover from backup. May > > > > leave > > > > partially backed up. Called from ttm_tt_populate() or its > > > > driver > > > > callbacks. > > > Yeah that this is done by a single function looks really strange > > > to > > > me. > > > > > > > ttm_pool_backup_tt() -> Attempts to back up (the not already > > > > backed > > > > up > > > > part of a tt. Called from ttm_tt_backup(), which is just a tt > > > > layer > > > > wrapper. If called with purge==true, then frees memory > > > > bypassing > > > > the > > > > pool to return it to the system directly. > > > > > > > > ttm_pool_free() -> Frees a (potentially backed up or partially > > > > backed > > > > up) tt. Called from ttm_tt_unpopulate() or its driver > > > > callbacks. > > > Ah! I missed that you have separated that functionality from the > > > free > > > path. > > > > > > I've only saw the allocation path and though I need to clear that > > > up > > > first. > > > > > > > So the backup functionality is implemented with a minimal > > > > change to > > > > upper layers, and I don't think there is a correctness problem > > > > on > > > > free(). > > > > > > > > So could you clarify a bit if it is this interface you think > > > > needs > > > > changing or that the implementation is better at separating out > > > > the > > > > backup functionality from the pool functionality? > > > I think we should just make the ttm pool object take charge of > > > allocation, backup, restore and free operation on the TT objects. > > > > > > And all of those are separate operations, they just internally > > > share > > > steps to archive what they should do. > > So are we looking at an interface change like: > > > > ttm_pool_alloc() // no recover. Errors if backed-up-data present. > > ttm_pool_alloc_and_recover() // because you can't alloc first and > > then > > recover in a memory-efficient manner, since you need to interleave. > > ttm_pool_backup() // as currently > > ttm_pool_drop_backed_up() //drops the backed-up data if any. > > ttm_pool_free() // frees all data. errors if backed-up-data > > present. > > > > Is this what you mean? > > Yes, exactly that. OK, then sure I'll update. > > > > > > BTW I really dislike that tt->restore is allocated dynamically. > > > That > > > is > > > just another allocation which can cause problems. > > > We should probably have all the state necessary for the operation > > > in > > > the > > > TT object. > > Initially it was done this way. But that meant a pre-allocated > > struct > > page-pointer array the of 1 << MAX_PAGE_ORDER size (2MiB) for each > > ttm_tt. That lead to a patch to reduce the MAX_PAGE_ORDER to PMD > > size > > order, but as you might remember, that needed to be ripped out > > because > > the PMD size macros aren't constant across all architectures. IIRC > > it > > was ARM causing compilation failures, and Linus wasn't happy. > > Yeah, I do remember that. But I don't fully get why you need this > page-pointer array in the first place? So the TTM page-pointer array holds the backup handles when backed up. During recovery, We allocate a (potentially huge) page and populate the TTM page-pointer array with pointers into that. Meanwhile we need to keep the backup handles for the recover phase in the restore structure, and in the middle of the recover phase you might hit an -EINTR. Thanks, Thomas > > > > > So, enter the dynamic allocation which is temporary, and 1/512 of > > the > > size of the memory we need to allocate for the buffer object. IIRC > > that > > was discussed with Matt when he reiewed and we concluded that it > > should > > be ok. I think this approach leads to less memory pressure than if > > we'd > > keep that array around all the time for *all* the allocated bos, > > and > > the allocation is never during reclaim. > > Hui? How do you avoid having to allocate that during reclaim? > > I absolutely don't see that on the code currently. During reclaim we back up only. When this allocation happens we're about to recover, which means we are not in reclaim. /Thomas > > Regards, > Christian. > > > > > Thanks, > > Thomas > > > > > > > > > Regards, > > > Christian. > > > > > > > Thanks, > > > > Thomas > > > > > > > > > > > > > > > > > > > > > Christian. > > > > > > > > > > > /Thomas > > > > > > > > > > > > > > > > > > > > For a renaming change that touch all TTM drivers, I'd > > > > > > > > rather > > > > > > > > put > > > > > > > > that > > > > > > > > as a last patch since getting acks for that from all > > > > > > > > TTM > > > > > > > > driver > > > > > > > > maintainers seems like a hopeless undertaking. > > > > > > > Yeah the acks are not the problem, merging it through the > > > > > > > xe > > > > > > > tree > > > > > > > would be. > > > > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > /Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > /Thomas >