Hi, As initially reported by Timo in the QXL driver will crash given enough workload: https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b9af@xxxxxxxxxxxxx/ I initially came across this problem when migrating Debian VMs from Bullseye to Bookworm. This bug will somewhat randomly but consistently happen, even just by using neovim with plugins or playing a video. This exception would then cascade and make Xorg crash too. The error log from dmesg would have `[TTM] Buffer eviction failed` followed by either a `failed to allocate VRAM BO` or `failed to allocate GEM object`. And the error log from Xorg would have `qxl(0): error doing QXL_ALLOC` followed by a backtrace and segmentation fault. I can confirm the problem still exists in latest kernel versions: https://gitlab.freedesktop.org/drm/kernel @ c6d6a82d8a9f https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git @ 1870cdc0e8de When I was investigating this issue I ended up creating a script which triggers the issue in just a couple of minutes when executed under uxterm. YMMV according to your system, for example when using urxvt crashes were not as consistent, likely due to it being more efficient and having less video memory allocations. For me this is the fastest way to trigger the bug. Here follows: ``` #!/bin/bash print_gradient_with_awk() { local arg="$1" if [[ -n $arg ]]; then arg=" ($arg)" fi awk -v arg="$arg" 'BEGIN{ s="/\\/\\/\\/\\/\\"; s=s s s s s s s s; for (colnum = 0; colnum<77; colnum++) { r = 255-(colnum*255/76); g = (colnum*510/76); b = (colnum*255/76); if (g>255) g = 510-g; printf "\033[48;2;%d;%d;%dm", r,g,b; printf "\033[38;2;%d;%d;%dm", 255-r,255-g,255-b; printf "%s\033[0m", substr(s,colnum+1,1); } printf "%s\n", arg; }' } for i in {1..10000}; do print_gradient_with_awk $i done ``` Timo initially reported: commit 5f6c871fe919 ("drm/qxl: properly free qxl releases") as working fine commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") introducing the bug The bug occurs whenever a timeout is reached in wait_event_timeout. To fix this issue I updated the code to include a busy wait logic, which was how the last working version operated. That fixes this bug while still keeping the code simple (which I suspect was the motivation for the 5a838e5d5825 commit in the first place), as opposed to just reverting to the last working version at 5f6c871fe919 The choice for the use of HZ as a scaling factor for the loop was that it is also used by ttm_bo_wait_ctx which is one of the indirect callers of qxl_fence_wait, with the other being ttm_bo_delayed_delete To confirm the problem no longer manifests I have: - executed my own test case pasted above - executed Timo's test case pasted below - played a video stream in mplayer for 3h (no audio stream because apparently pulseaudio and/or alsa have memory leaks that make the system run out of memory) For quick reference here is Timo's script: ``` #!/bin/bash chvt 3 for j in $(seq 80); do echo "$(date) starting round $j" if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ]; then echo "bug was reproduced after $j tries" exit 1 fi for i in $(seq 100); do dmesg > /dev/tty3 done done echo "bug could not be reproduced" exit 0 ``` >From what I could find online it seems that users that have been affected by this problem just tend to move from QXL to VirtIO, that is why this bug has been hidding for over 3 years now. This issue was initially reported by Timo 4 months ago but the discussion seems to have stalled. It would be great if this could be addressed and avoid it falling through the cracks. Thank you for your time. --- Alex Constantino (1): drm/qxl: fixes qxl_fence_wait drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) base-commit: 1870cdc0e8dee32e3c221704a2977898ba4c10e8 -- 2.39.2