Re comments about infinite retry: gotcha, makes sense to me. On Tue, 2022-09-27 at 09:45 +1000, Alistair Popple wrote: > John Hubbard <jhubbard@xxxxxxxxxx> writes: > > > On 9/26/22 14:35, Lyude Paul wrote: > > > > + for (i = 0; i < npages; i++) { > > > > + if (src_pfns[i] & MIGRATE_PFN_MIGRATE) { > > > > + struct page *dpage; > > > > + > > > > + /* > > > > + * _GFP_NOFAIL because the GPU is going away and there > > > > + * is nothing sensible we can do if we can't copy the > > > > + * data back. > > > > + */ > > > > > > You'll have to excuse me for a moment since this area of nouveau isn't one of > > > my strongpoints, but are we sure about this? IIRC __GFP_NOFAIL means infinite > > > retry, in the case of a GPU hotplug event I would assume we would rather just > > > stop trying to migrate things to the GPU and just drop the data instead of > > > hanging on infinite retries. > > > > > No problem, thanks for taking a look! > > > Hi Lyude! > > > > Actually, I really think it's better in this case to keep trying > > (presumably not necessarily infinitely, but only until memory becomes > > available), rather than failing out and corrupting data. > > > > That's because I'm not sure it's completely clear that this memory is > > discardable. And at some point, we're going to make this all work with > > file-backed memory, which will *definitely* not be discardable--I > > realize that we're not there yet, of course. > > > > But here, it's reasonable to commit to just retrying indefinitely, > > really. Memory should eventually show up. And if it doesn't, then > > restarting the machine is better than corrupting data, generally. > > The memory is definitely not discardable here if the migration failed > because that implies it is still mapped into some userspace process. > > We could avoid restarting the machine by doing something similar to what > happens during memory failure and killing every process that maps the > page(s). But overall I think it's better to retry until memory is > available, because that allows things like reclaim to work and in the > worst case allows the OOM killer to select an appropriate task to kill. > It also won't cause data corruption if/when we have file-backed memory. > > > thanks, > -- Cheers, Lyude Paul (she/her) Software Engineer at Red Hat