Re: Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix?

David Airlie <airlied@xxxxxxxxxx> · Fri, 19 Apr 2024 08:14:43 +1000

On Fri, Apr 19, 2024 at 6:27 AM Lyude Paul <lyude@xxxxxxxxxx> wrote:
>
> So - first some context here for Ben and anyone else who hasn't been
> following. A little while ago I got a Slimbook Executive 16 with a
> Nvidia RTX 4060 in it, and I've unfortunately been running into a kind
> of annoying issue. Currently this laptop only has 16 gigs of ram, and
> as it turns out - this can easily lead the system to having pretty
> heavy memory fragmentation once it starts swapping pages out.
>
> Normally this wouldn't matter, but I unfortunately discovered that when
> we're runtime suspending the GPU in Nouveau - we actually appear to
> allocate some of the memory we use for migrating using
> dma_alloc_coherent. This starts to fail on my system once memory
> fragmentation goes up like so:
>
>   kworker/18:0: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
>   nodemask=(null),cpuset=/,mems_allowed=0
>   CPU: 18 PID: 287012 Comm: kworker/18:0 Not tainted
>   6.8.4-200.ChopperV1.fc39.x86_64 #1
>   Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
>   Workqueue: pm pm_runtime_work
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x47/0x60
>    warn_alloc+0x165/0x1e0
>    ? __alloc_pages_direct_compact+0x1ad/0x2b0
>    __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
>    __alloc_pages+0x32d/0x350
>    __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
>    dma_direct_alloc+0x70/0x280
>    nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
>    r535_gsp_fini+0x1d4/0x350 [nouveau]
>    nvkm_subdev_fini+0x67/0x150 [nouveau]
>    nvkm_device_fini+0x95/0x1e0 [nouveau]
>    nvkm_udevice_fini+0x53/0x70 [nouveau]
>    nvkm_object_fini+0xb9/0x240 [nouveau]
>    nvkm_object_fini+0x75/0x240 [nouveau]
>    nouveau_do_suspend+0xf5/0x280 [nouveau]
>    nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
>    pci_pm_runtime_suspend+0x67/0x1e0
>    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>    __rpm_callback+0x41/0x170
>    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>    rpm_callback+0x5d/0x70
>    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>    rpm_suspend+0x120/0x6a0
>    pm_runtime_work+0x98/0xb0
>    process_one_work+0x171/0x340
>    worker_thread+0x27b/0x3a0
>    ? __pfx_worker_thread+0x10/0x10
>    kthread+0xe5/0x120
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork+0x31/0x50
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork_asm+0x1b/0x30
>
>   nouveau 0000:01:00.0: gsp: suspend failed, -12
>   nouveau: DRM-master:00000000:00000080: suspend failed with -12
>   nouveau 0000:01:00.0: can't suspend (nouveau_pmops_runtime_suspend
>   [nouveau] returned -12)
>
> Keep in mind, I don't dive into memory management related stuff like
> this very often! But I'd very much like to know how to help out
> anywhere around the driver, including outside of my usual domains, so
> I've been trying to write up a patch for this. The original suggestion
> for a fix that Dave Airlie had given me was (unless I misunderstood,
> which isn't unlikely) to try to see if we could get nvkm_gsp_mem_ctor()
> to start allocating memory with vmalloc() and map that onto the GPU
> using the SG helpers instead. So - I gave a shot at writing up a patch
> for doing that:
>
> https://gitlab.freedesktop.org/lyudess/linux/-/commit/b5a41ac2bd948979815d262d8d20b4f3333f9c26
>
> As you can probably guess - the patch does not really seem to work, and
> I've been trying to figure out why. There's already a couple of issues
> I'm aware of: the most glaring one being that as Timur pointed out, a
> lot of GSP hardware expects contiguous memory allocations - but
> according to them the allocation that's specifically failing should be
> small enough that it'd be allocated in a contiguous page anyway:

nvkm_gsp_mem_ctor is used to do coherent allocations in a bunch of
places in the gsp code, we can't use vmalloc for a lot of them. A lot
of the allocations are small multi-page and hang around and the
hardware expects allocations to be non-scattered.

Now in this single case we have a large amount of data pointed to by a
radix3 page table.

The data is allocated with nvkm_gsp_sg, then we fail to allocate the
first level of page tables with the coherent allocation. However I
don't think the first level of the page table needs to be allocated
with the coherent allocator, we should allocate it with nvkm_gsp_sg
instead.

Dave.