On Mon, Sep 02, 2024 at 01:01:45PM +0200, Christian König wrote: > Am 30.08.24 um 00:12 schrieb Matthew Brost: > > On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote: > > > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote: > > > > But as Sima pointed out in private communication, exhaustive eviction > > > > is not really needed for faulting to make (crawling) progress. > > > > Watermarks and VRAM trylock shrinking should suffice, since we're > > > > strictly only required to service a single gpu page granule at a time. > > > > > > > > However, ordinary bo-based jobs would still like to be able to > > > > completely evict SVM vram. Whether that is important enough to strive > > > > for is ofc up for discussion. > > > My take is that you don't win anything for exhaustive eviction by having > > > the dma_resv somewhere in there for svm allocations. Roughly for split lru > > > world, where svm ignores bo/dma_resv: > > > > > > When evicting vram from the ttm side we'll fairly switch between selecting > > > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo > > > will eventually succeed in vacuuming up everything (with a few retries > > > perhaps, if we're not yet at the head of the ww ticket queue). > > > > > > svm pages we need to try to evict anyway - there's no guarantee, becaue > > > the core mm might be holding temporary page references (which block > > Yea, but think you can could kill the app then - not suggesting we > > should but could. To me this is akin to a CPU fault and not being able > > to migrate the device pages - the migration layer doc says when this > > happens kick this to user space and segfault the app. > > That's most likely a bad idea. That the core holds a temporary page > reference can happen any time without any bad doing from the application. > E.g. for direct I/O, swapping etc... > > So you can't punish the application with a segfault if you happen to not be > able to migrate a page because it has a reference. See my other reply, it even happens as a direct consequence of a 2nd thread trying to migrate the exact same page from vram to sram. And that really is a core use case. RESo yeah, we really can't SIGBUS on this case. -Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch