On Fri, Aug 09, 2024 at 03:53:20PM +0200, Thomas Hellström wrote: > On Wed, 2024-08-07 at 23:43 +0000, Matthew Brost wrote: > > On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote: > > > Use fault-injection to test partial TTM swapout and interrupted > > > swapin. > > > Return -EINTR for swapin to test the callers ability to handle and > > > restart the swapin, and on swapout perform a partial swapout to > > > test that > > > the swapin and release_shrunken functionality. > > > > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx> > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx> > > > Cc: <dri-devel@xxxxxxxxxxxxxxxxxxxxx> > > > Signed-off-by: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> > > > --- > > > drivers/gpu/drm/Kconfig | 10 ++++++++++ > > > drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++- > > > 2 files changed, 26 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig > > > index fd0749c0c630..9f27271bfab8 100644 > > > --- a/drivers/gpu/drm/Kconfig > > > +++ b/drivers/gpu/drm/Kconfig > > > @@ -272,6 +272,16 @@ config DRM_GPUVM > > > GPU-VM representation providing helpers to manage a GPUs > > > virtual > > > address space > > > > > > +config DRM_TTM_BACKUP_FAULT_INJECT > > > + bool "Enable fault injection during TTM backup" > > > + depends on DRM_TTM > > > + default n > > > + help > > > + Inject recoverable failures during TTM backup and > > > recovery of > > > + backed-up objects. For DRM driver developers only. > > > + > > > + If in doubt, choose N. > > > + > > > config DRM_BUDDY > > > tristate > > > depends on DRM > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > index 38e50cf81b0a..d32a1f2e5e50 100644 > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct > > > ttm_pool_tt_restore *restore, > > > struct ttm_backup *backup, > > > struct ttm_operation_ctx *ctx) > > > { > > > + static unsigned long __maybe_unused swappedin; > > > unsigned int i, nr = 1 << restore->order; > > > int ret = 0; > > > > > > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct > > > ttm_pool_tt_restore *restore, > > > if (handle == 0) > > > continue; > > > > > > + if > > > (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) && > > > + ctx->interruptible && > > > + ++swappedin % 100 == 0) { > > > + ret = -EINTR; > > > + break; > > > + } > > > > So here this -EINTR would be kicked to the user IOCTL which triggered > > the BO validate and retry? The restore then should be able to > > successfully pick up where it left off? > > Yes, that's the point. For the direct swap-cache backend I initially > used (before concluding that the shmem one actually seemed to work > fine), we had an interruptible wait here. > > Supporting interrupts is generally a good thing but for the pool code, > this makes the code considerably more complicated. However, this is a > good way to ensure drivers actually support -EINTR for the call chain. > If not, adding interrupt capability "later" will most likely be a PITA. > > > > > > + > > > ret = backup->ops->copy_backed_up_page > > > (backup, restore->first_page[i], > > > handle, ctx->interruptible); > > > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool, > > > struct ttm_tt *ttm, bool purge, > > > > > > alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | > > > __GFP_RETRY_MAYFAIL; > > > > > > - for (i = 0; i < ttm->num_pages; ++i) { > > > + num_pages = ttm->num_pages; > > > + > > > + /* Pretend doing fault injection by shrinking only half of > > > the pages. */ > > > + > > > + if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT)) > > > + num_pages = DIV_ROUND_UP(num_pages, 2); > > > > So what happens here? Half the pages swapped out, then upon restore > > half > > swapped back in? The shrinker continues to walk until enough pages > > swapped out? > > Yes, exactly. Ideally we'd want some intermediate state here so that a > partially swapped out bo is still eligible for further shrinking. > Cool, glad my understanding is correct. Having error injection is always a good idea to ensure error paths / corner cases work rather than finding they blow up much later when these cases somehow get triggered. With that: Reviewed-by: Matthew Brost <matthew.brost@xxxxxxxxx> > /Thomas > > > > > > Matt > > > > > + > > > + for (i = 0; i < num_pages; ++i) { > > > page = ttm->pages[i]; > > > if (unlikely(!page)) > > > continue; > > > -- > > > 2.44.0 > > > >