CentOS8 nouveau errors

Bill Gee <bgee@xxxxxxxxxxxxxxx> · Wed, 21 Jul 2021 07:34:16 -0500

I am running the stream version of CentOS 8 on a system that hosts a bunch of VirtualBox guests.    Recently, like about a month ago, a problem started showing up.  When working at the console, the system is almost completely unresponsive.  In a terminal session, any key that you press might show up a minute later.  Any mouse movement might show up a minute later.  

About every ten seconds the screen switches to a completely different display for maybe a few tens of milliseconds and then switches back.  This happens so fast it is almost subliminal.

The log fills with tens of thousands of messages per day that look like this:

Jul 21 07:21:06 vmhost2 kernel: Hardware name: Supermicro C7SIM-Q/C7SIM-Q, BIOS 1.2a       06/02/2017
Jul 21 07:21:06 vmhost2 kernel: Workqueue: events_unbound nv50_disp_atomic_commit_work [nouveau]
Jul 21 07:21:06 vmhost2 kernel: RIP: 0010:nv50_dmac_wait+0x1e1/0x230 [nouveau]
Jul 21 07:21:06 vmhost2 kernel: Code: 8d 48 04 48 89 4a 68 c7 00 00 00 00 20 49 8b 46 38 41 c7 86 20 01 00 00 00 00 00 0
0 49 89 46 68 e8 d4 fc ff ff e9 76 fe ff ff <0f> 0b b8 92 ff ff ff e9 ed fe ff ff 49 8b be 80 00 00 00 e8 b7 fc
Jul 21 07:21:06 vmhost2 kernel: RSP: 0018:ffff9ab3c2077d60 EFLAGS: 00010282
Jul 21 07:21:06 vmhost2 kernel: RAX: ffffffffffffff92 RBX: ffff9ab3c2077d60 RCX: 0000000000000000
Jul 21 07:21:06 vmhost2 kernel: RDX: ffffffffffffff92 RSI: ffff9ab3c2077ca0 RDI: ffff9ab3c2077d40
Jul 21 07:21:06 vmhost2 kernel: RBP: 0000000000000002 R08: 0000000000000000 R09: ffffffffc0365fd0
Jul 21 07:21:06 vmhost2 kernel: R10: 0000000000000006 R11: ffffffffc03733c0 R12: 00000000fffffffb
Jul 21 07:21:06 vmhost2 kernel: R13: ffff8c141b820b68 R14: ffff8c141b820ba8 R15: ffff8c1144d21e00
Jul 21 07:21:06 vmhost2 kernel: FS:  0000000000000000(0000) GS:ffff8c142fd80000(0000) knlGS:0000000000000000
Jul 21 07:21:06 vmhost2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 21 07:21:06 vmhost2 kernel: CR2: 000055e49fe63ef0 CR3: 00000001a5410000 CR4: 00000000000006e0
Jul 21 07:21:06 vmhost2 kernel: Call Trace:
Jul 21 07:21:06 vmhost2 kernel: base507c_update+0x2f/0x70 [nouveau]
Jul 21 07:21:06 vmhost2 kernel: nv50_disp_atomic_commit_wndw.isra.16+0x5f/0x80 [nouveau]
Jul 21 07:21:06 vmhost2 kernel: nv50_disp_atomic_commit_tail+0x669/0x9b0 [nouveau]
Jul 21 07:21:06 vmhost2 kernel: process_one_work+0x1a7/0x360
Jul 21 07:21:06 vmhost2 kernel: worker_thread+0x30/0x390
Jul 21 07:21:06 vmhost2 kernel: ? create_worker+0x1a0/0x1a0
Jul 21 07:21:06 vmhost2 kernel: kthread+0x116/0x130
Jul 21 07:21:06 vmhost2 kernel: ? kthread_flush_work_fn+0x10/0x10
Jul 21 07:21:06 vmhost2 kernel: ret_from_fork+0x35/0x40
Jul 21 07:21:06 vmhost2 kernel: ---[ end trace 43122b13e20cf558 ]---
Jul 21 07:21:08 vmhost2 kernel: nouveau 0000:01:00.0: DRM: base-0: timeout

The daily logwatch report has thousands of lines like this:

--------------------- Kernel Begin ------------------------ 

 WARNING:  Kernel Errors Present
    WARNING: CPU: 0 PID: 131209 at drivers/gpu/drm/n ...:  201 Time(s)
    WARNING: CPU: 0 PID: 131356 at drivers/gpu/drm/n ...:  450 Time(s)
    WARNING: CPU: 0 PID: 132285 at drivers/gpu/drm/n ...:  3 Time(s)
    WARNING: CPU: 0 PID: 15274 at drivers/gpu/drm/no ...:  1245 Time(s)
    WARNING: CPU: 0 PID: 15808 at drivers/gpu/drm/no ...:  144 Time(s)
    WARNING: CPU: 0 PID: 62 at drivers/gpu/drm/nouve ...:  1080 Time(s)
    WARNING: CPU: 1 PID: 131209 at drivers/gpu/drm/n ...:  336 Time(s)
    WARNING: CPU: 1 PID: 131356 at drivers/gpu/drm/n ...:  828 Time(s)
    WARNING: CPU: 1 PID: 132030 at drivers/gpu/drm/n ...:  18 Time(s)
    WARNING: CPU: 1 PID: 132285 at drivers/gpu/drm/n ...:  12 Time(s)
    WARNING: CPU: 1 PID: 15274 at drivers/gpu/drm/no ...:  903 Time(s)
    WARNING: CPU: 1 PID: 15808 at drivers/gpu/drm/no ...:  108 Time(s)   (repeat for 2 megabytes of text)

With all VB guests stopped, htop shows CPU usage in the range of 1% or less.  Sessions opened by ssh run normally.  Sessions opened to a guest with xrdp run normally.  Guests do not report any errors in their log files.

Where do I begin to troubleshoot something like this?

Thanks -
========
Bill Gee

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos