PR: nv50 IB-mode DMA crash fixes

Monty Montgomery <xiphmont@xxxxxxxxx> · Sun, 3 Dec 2023 11:37:11 -0500

I didn't file an Issue as I understand the cause and have a successful
patch that I've been running here for several days.
gitlab.freedesktop.org is RO (ie, I can't fork/stage a PR there) so I
committed the absurdity of creating a PR on the public gitlab mirror
in order to have a URL to point to:

https://gitlab.com/freedesktop-mirror/drm/-/merge_requests/1

copypasting the commit message:

    nv50 IB-mode DMA crash fixes

    Address the following nv50 IB-mode dma code errors:

    1) nouveau_drm.h labels NV50_DMA_PUSH_MAX_LENGTH "Maximum push buffer
    size". Perhaps originally true(?), but nothing suggests, guards, or
    enforces the assumption, and it's not true in practice.  0x7fffff is
    not the maximum possible push buffer size submitted for IB-mode DMA
    transfers, nor is it the maximum legal DMA transfer size (it is three
    bytes too large according to our reverse-engineering docs).

    2) nouveau_gem_ioctl_pushbuf() still implicitly treats bit 23 of the
    pushbuffer buffer[].length field as NOUVEAU_GEM_PUSHBUF_NO_PREFETCH,
    despite previous commits moving the flag out of the length field for
    lower levels as this is an nv50-hardware-specific arrangement.  It
    also conflicts with point 1 above as pushbuffers can be equal to or
    greater than 1<<23 in practice.  (Does the ioctl code in question have
    reason to set or expose no_prefetch?)

    3) Bits 0 and 1 of the length field are passed through and placed into
    the IB entry, but they don't signify length in the DMA request.  Bit 0
    is labeled an unknown flag, bit 1 specifies use of the non_main
    transfer counter.  DMA transfers can only be whole words, despite
    pushbuffers not being restricted to whole-word lengths.  So there are
    two ways this can bite: Setting flags we don't mean to set and not
    transferring trailing bytes in incomplete words.

    Changes:

    This PR makes the following changes toward the single logical change
    of 'Don't lock up Quadros when using CAD':

    1) NV50_DMA_PUSH_MAX_LENGTH is reduced from 0x7fffff to 0x7ffffc, the
    maximum value that can actually be encoded in a DMA request.

    2) nouveau_gem_ioctl_pushbuf() no longer looks for
    NOUVEAU_GEM_PUSHBUF_NO_PREFETCH in bit 23 of the length field, and
    always sets no_prefetch to 0 in the call to nv50_dma_push() when using
    IB-mode.

    3) nouveau_gem_ioctl_pushbuf() specifies (push[].length + 0x3) << 2 >>
    2 for DMA transfer length to enforce complete-word IB-mode transfers
    that also send trailing bytes.

    4) nouveau_gem_ioctl_pushbuf() splits pushbuffers longer than
    NV50_DMA_PUSH_MAX_LENGTH (as calculated in 3 above) into multiple
    calls to nv50_dma_push() when using nv50 IB-mode.

Although I understand the cause, the patch raises a few other
unknowns-to-me I'd like some commentary on:

1) The nv50 DMA code clearly believed push buffers could not (or
should not) exceed NV50_DMA_PUSH_MAX_LENGTH, when my analysis logging
shows that they regularly exceed this maximum by an order of magnitude
or more.  Was the assumption originally true?  If so, when/why did it
change?  Is the more correct fix to enforce individual pushbuf size
limits?

2) Are any of the documented bit flags that 'hug' the length field
(NON_MAIN, NO_PREFETCH, etc) actually used/useful at the level of the
GEM API?  Right now, two are exposed by accident (because the code
thinks the length is byte-granularity when the lower two bits are
actually flags) and the third is broken out explicitly--- but it's
actually only ever set by a clobber from push[i].length exceeding
0x7fffff.

Monty