On Tue, 2024-12-03 at 15:51 +0100, Christian König wrote: > Am 03.12.24 um 14:42 schrieb Thomas Hellström: > > On Tue, 2024-12-03 at 14:12 +0100, Christian König wrote: > > > Am 15.11.24 um 16:01 schrieb Thomas Hellström: > > > > Provide a helper to shrink ttm_tt page-vectors on a per-page > > > > basis. A ttm_backup backend could then in theory get away with > > > > allocating a single temporary page for each struct ttm_tt. > > > > > > > > This is accomplished by splitting larger pages before trying to > > > > back them up. > > > > > > > > In the future we could allow ttm_backup to handle backing up > > > > large pages as well, but currently there's no benefit in > > > > doing that, since the shmem backup backend would have to > > > > split those anyway to avoid allocating too much temporary > > > > memory, and if the backend instead inserts pages into the > > > > swap-cache, those are split on reclaim by the core. > > > > > > > > Due to potential backup- and recover errors, allow partially > > > > swapped > > > > out struct ttm_tt's, although mark them as swapped out stopping > > > > them > > > > from being swapped out a second time. More details in the > > > > ttm_pool.c > > > > DOC section. > > > > > > > > v2: > > > > - A couple of cleanups and error fixes in ttm_pool_back_up_tt. > > > > - s/back_up/backup/ > > > > - Add a writeback parameter to the exported interface. > > > > v8: > > > > - Use a struct for flags for readability (Matt Brost) > > > > - Address misc other review comments (Matt Brost) > > > > v9: > > > > - Update the kerneldoc for the ttm_tt::backup field. > > > > v10: > > > > - Rebase. > > > > v13: > > > > - Rebase on ttm_backup interface change. Update kerneldoc. > > > > - Rebase and adjust ttm_tt_is_swapped(). > > > > > > > > Cc: Christian König<christian.koenig@xxxxxxx> > > > > Cc: Somalapuram Amaranath<Amaranath.Somalapuram@xxxxxxx> > > > > Cc: Matthew Brost<matthew.brost@xxxxxxxxx> > > > > Cc:<dri-devel@xxxxxxxxxxxxxxxxxxxxx> > > > > Signed-off-by: Thomas > > > > Hellström<thomas.hellstrom@xxxxxxxxxxxxxxx> > > > > Reviewed-by: Matthew Brost<matthew.brost@xxxxxxxxx> > > > > --- > > > > drivers/gpu/drm/ttm/ttm_pool.c | 396 > > > > +++++++++++++++++++++++++++++++-- > > > > drivers/gpu/drm/ttm/ttm_tt.c | 37 +++ > > > > include/drm/ttm/ttm_pool.h | 6 + > > > > include/drm/ttm/ttm_tt.h | 32 ++- > > > > 4 files changed, 457 insertions(+), 14 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > > index 8504dbe19c1a..f58864439edb 100644 > > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > > @@ -41,6 +41,7 @@ > > > > #include <asm/set_memory.h> > > > > #endif > > > > > > > > +#include <drm/ttm/ttm_backup.h> > > > > #include <drm/ttm/ttm_pool.h> > > > > #include <drm/ttm/ttm_tt.h> > > > > #include <drm/ttm/ttm_bo.h> > > > > @@ -58,6 +59,32 @@ struct ttm_pool_dma { > > > > unsigned long vaddr; > > > > }; > > > > > > > > +/** > > > > + * struct ttm_pool_tt_restore - State representing restore > > > > from > > > > backup > > > > + * @alloced_pages: Total number of already allocated pages for > > > > the > > > > ttm_tt. > > > > + * @restored_pages: Number of (sub) pages restored from swap > > > > for > > > > this > > > > + * chunk of 1 << @order pages. > > > > + * @first_page: The ttm page ptr representing for > > > > @old_pages[0]. > > > > + * @caching_divide: Page pointer where subsequent pages are > > > > cached. > > > > + * @old_pages: Backup copy of page pointers that were replaced > > > > by > > > > the new > > > > + * page allocation. > > > > + * @pool: The pool used for page allocation while restoring. > > > > + * @order: The order of the last page allocated while > > > > restoring. > > > > + * > > > > + * Recovery from backup might fail when we've recovered less > > > > than > > > > the > > > > + * full ttm_tt. In order not to loose any data (yet), keep > > > > information > > > > + * around that allows us to restart a failed ttm backup > > > > recovery. > > > > + */ > > > > +struct ttm_pool_tt_restore { > > > > + pgoff_t alloced_pages; > > > > + pgoff_t restored_pages; > > > > + struct page **first_page; > > > > + struct page **caching_divide; > > > > + struct ttm_pool *pool; > > > > + unsigned int order; > > > > + struct page *old_pages[]; > > > > +}; > > > > + > > > > static unsigned long page_pool_size; > > > > > > > > MODULE_PARM_DESC(page_pool_size, "Number of pages in the > > > > WC/UC/DMA pool"); > > > > @@ -354,11 +381,105 @@ static unsigned int > > > > ttm_pool_page_order(struct ttm_pool *pool, struct page *p) > > > > return p->private; > > > > } > > > > > > > > +/* > > > > + * To be able to insert single pages into backup directly, > > > > + * we need to split multi-order page allocations and make them > > > > look > > > > + * like single-page allocations. > > > > + */ > > > > +static void ttm_pool_split_for_swap(struct ttm_pool *pool, > > > > struct > > > > page *p) > > > > +{ > > > > + unsigned int order = ttm_pool_page_order(pool, p); > > > > + pgoff_t nr; > > > > + > > > > + if (!order) > > > > + return; > > > > + > > > > + split_page(p, order); > > > What exactly should split_page() do here and why is that > > > necessary? > > > > > > IIRC that function just updated the reference count and updated > > > things > > > like page owner tracking and memcg accounting. Which should both > > > be > > > completely irrelevant here. > > > > > > Or do you just do that so that you can free each page > > > individually? > > Yes, exactly. Like For a 2MiB page we'd otherwise have to allocate > > 2MiB > > of shmem backing storage, potentially from kernel reserves before > > we > > could actually free anything. Since (currently) the shmem objects > > we > > use are 4K-page only, this should make the process "allocate shmem > > and > > back up" much less likely to deplete the kernel memory reserves. > > Ah, yes that makes totally sense now. > > > > > Taking a step back and looking at potentially other solution, like > > direct insertion into the swap cache, then even if inserting a 2MiB > > page into the swap cache, vmscan would split it before writeback, > > and > > still it didn't appear very stable. So inserting one 4K page at a > > time > > seemed neccessary. If I were to take a guess that's why shmem, when > > configured for 2MiB pages, like with i915, also splits the pages > > before > > moving to swap-cache / writeback. > > > > > > > > + nr = 1UL << order; > > > > + while (nr--) > > > > + (p++)->private = 0; > > > > +} > > > > + > > > > +/** > > > > + * DOC: Partial backup and restoration of a struct ttm_tt. > > > > + * > > > > + * Swapout using ttm_backup_backup_page() and swapin using > > > > + * ttm_backup_copy_page() may fail. > > > > + * The former most likely due to lack of swap-space or memory, > > > > the > > > > latter due > > > > + * to lack of memory or because of signal interruption during > > > > waits. > > > > + * > > > > + * Backup failure is easily handled by using a ttm_tt pages > > > > vector > > > > that holds > > > > + * both swap entries and page pointers. This has to be taken > > > > into > > > > account when > > > > + * restoring such a ttm_tt from backup, and when freeing it > > > > while > > > > backed up. > > > > + * When restoring, for simplicity, new pages are actually > > > > allocated from the > > > > + * pool and the contents of any old pages are copied in and > > > > then > > > > the old pages > > > > + * are released. > > > > + * > > > > + * For restoration failures, the struct ttm_pool_tt_restore > > > > holds > > > > sufficient state > > > > + * to be able to resume an interrupted restore, and that > > > > structure > > > > is freed once > > > > + * the restoration is complete. If the struct ttm_tt is > > > > destroyed > > > > while there > > > > + * is a valid struct ttm_pool_tt_restore attached, that is > > > > also > > > > properly taken > > > > + * care of. > > > > + */ > > > > + > > > > +static bool ttm_pool_restore_valid(const struct > > > > ttm_pool_tt_restore *restore) > > > > +{ > > > > + return restore && restore->restored_pages < (1 << > > > > restore- > > > > > order); > > > > +} > > > > + > > > > +static int ttm_pool_restore_tt(struct ttm_pool_tt_restore > > > > *restore, > > > > + struct ttm_backup *backup, > > > > + struct ttm_operation_ctx *ctx) > > > > +{ > > > > + unsigned int i, nr = 1 << restore->order; > > > > + int ret = 0; > > > > + > > > > + if (!ttm_pool_restore_valid(restore)) > > > > + return 0; > > > > + > > > > + for (i = restore->restored_pages; i < nr; ++i) { > > > > + struct page *p = restore->old_pages[i]; > > > > + > > > > + if (ttm_backup_page_ptr_is_handle(p)) { > > > > + unsigned long handle = > > > > ttm_backup_page_ptr_to_handle(p); > > > > + > > > > + if (handle == 0) > > > > + continue; > > > > + > > > > + ret = ttm_backup_copy_page > > > > + (backup, restore- > > > > >first_page[i], > > > > + handle, ctx->interruptible); > > > That coding style looks really odd, I didn't even notice that it > > > is a > > > function call initially. > > > > > > Maybe put everything under the if into a separate function. > > At a minimum, I'll fix up the formatting here. > > > > > > + if (ret) > > > > + break; > > > > + > > > > + ttm_backup_drop(backup, handle); > > > > + } else if (p) { > > > > + /* > > > > + * We could probably avoid splitting > > > > the > > > > old page > > > > + * using clever logic, but ATM we > > > > don't > > > > care, as > > > > + * we prioritize releasing memory > > > > ASAP. > > > > Note that > > > > + * here, the old retained page is > > > > always > > > > write-back > > > > + * cached. > > > > + */ > > > > + ttm_pool_split_for_swap(restore->pool, > > > > p); > > > > + copy_highpage(restore->first_page[i], > > > > p); > > > > + __free_pages(p, 0); > > > > + } > > > > + > > > > + restore->restored_pages++; > > > > + restore->old_pages[i] = NULL; > > > > + cond_resched(); > > > There is a push to remove cond_resched(), see here: > > > https://patchwork.kernel.org/project/linux-mm/patch/20231107230822.371443-30-ankur.a.arora@xxxxxxxxxx/ > > > > > > Not sure in which discussion that removal went, but IIRC we > > > should > > > not > > > add any new users of it. > > I'll read up on that and remove if needed. I'm curious how / if > > voluntary preemption is going to be handled. > > I didn't fully understood it either, but the push kind of seems to be > that drivers or in this cases subsystems are not supposed to mess > with > cond_resched() any more and just rely on preemptive kernels. > > > > So I took a deeper look into this. From what I can tell, cond_resched() is replaced by some other implicit preemption mechanism, and it seems the series is still being worked on, but meanwhile there's nothing ensuring that latency-causing long loops will be preempted. So IMHO it should be easy to just remove the cond_resched() when that series lands, and if it is deemed necessary to add it meanwhile. But OTOH, the cond_resched() in this code was added without benchmark justification, so I have removed it. If needed could be re-added pending the merge of the new preemption code. Thanks, Thomas