Re: Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]<

 



On 19/4/24 08:14, David Airlie wrote:

On Fri, Apr 19, 2024 at 6:27 AM Lyude Paul <lyude@xxxxxxxxxx> wrote:
So - first some context here for Ben and anyone else who hasn't been
following. A little while ago I got a Slimbook Executive 16 with a
Nvidia RTX 4060 in it, and I've unfortunately been running into a kind
of annoying issue. Currently this laptop only has 16 gigs of ram, and
as it turns out - this can easily lead the system to having pretty
heavy memory fragmentation once it starts swapping pages out.

Normally this wouldn't matter, but I unfortunately discovered that when
we're runtime suspending the GPU in Nouveau - we actually appear to
allocate some of the memory we use for migrating using
dma_alloc_coherent. This starts to fail on my system once memory
fragmentation goes up like so:

   kworker/18:0: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
   nodemask=(null),cpuset=/,mems_allowed=0
   CPU: 18 PID: 287012 Comm: kworker/18:0 Not tainted
   6.8.4-200.ChopperV1.fc39.x86_64 #1
   Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
   Workqueue: pm pm_runtime_work
   Call Trace:
    <TASK>
    dump_stack_lvl+0x47/0x60
    warn_alloc+0x165/0x1e0
    ? __alloc_pages_direct_compact+0x1ad/0x2b0
    __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
    __alloc_pages+0x32d/0x350
    __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
    dma_direct_alloc+0x70/0x280
    nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
    r535_gsp_fini+0x1d4/0x350 [nouveau]
    nvkm_subdev_fini+0x67/0x150 [nouveau]
    nvkm_device_fini+0x95/0x1e0 [nouveau]
    nvkm_udevice_fini+0x53/0x70 [nouveau]
    nvkm_object_fini+0xb9/0x240 [nouveau]
    nvkm_object_fini+0x75/0x240 [nouveau]
    nouveau_do_suspend+0xf5/0x280 [nouveau]
    nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
    pci_pm_runtime_suspend+0x67/0x1e0
    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
    __rpm_callback+0x41/0x170
    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
    rpm_callback+0x5d/0x70
    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
    rpm_suspend+0x120/0x6a0
    pm_runtime_work+0x98/0xb0
    process_one_work+0x171/0x340
    worker_thread+0x27b/0x3a0
    ? __pfx_worker_thread+0x10/0x10
    kthread+0xe5/0x120
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x31/0x50
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1b/0x30

   nouveau 0000:01:00.0: gsp: suspend failed, -12
   nouveau: DRM-master:00000000:00000080: suspend failed with -12
   nouveau 0000:01:00.0: can't suspend (nouveau_pmops_runtime_suspend
   [nouveau] returned -12)

Keep in mind, I don't dive into memory management related stuff like
this very often! But I'd very much like to know how to help out
anywhere around the driver, including outside of my usual domains, so
I've been trying to write up a patch for this. The original suggestion
for a fix that Dave Airlie had given me was (unless I misunderstood,
which isn't unlikely) to try to see if we could get nvkm_gsp_mem_ctor()
to start allocating memory with vmalloc() and map that onto the GPU
using the SG helpers instead. So - I gave a shot at writing up a patch
for doing that:

https://gitlab.freedesktop.org/lyudess/linux/-/commit/b5a41ac2bd948979815d262d8d20b4f3333f9c26

As you can probably guess - the patch does not really seem to work, and
I've been trying to figure out why. There's already a couple of issues
I'm aware of: the most glaring one being that as Timur pointed out, a
lot of GSP hardware expects contiguous memory allocations - but
according to them the allocation that's specifically failing should be
small enough that it'd be allocated in a contiguous page anyway:
nvkm_gsp_mem_ctor is used to do coherent allocations in a bunch of
places in the gsp code, we can't use vmalloc for a lot of them. A lot
of the allocations are small multi-page and hang around and the
hardware expects allocations to be non-scattered.

Now in this single case we have a large amount of data pointed to by a
radix3 page table.

The data is allocated with nvkm_gsp_sg, then we fail to allocate the
first level of page tables with the coherent allocation. However I
don't think the first level of the page table needs to be allocated
with the coherent allocator, we should allocate it with nvkm_gsp_sg
instead.

Yes, that seems sensible here.  Lyude, did you want me to take a look at making this change, or are you working on it already?

Ben.


Dave.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux