scsi bus disconnect on high load in qemu kvm anno 2024

Marius Schwarz <fedoradev@xxxxxxxxxxxx> · Tue, 19 Nov 2024 13:52:52 +0100

Hi,

a bug in the scsi subsystem has been discovered in conjunction with 
Fedora 40, that can&will destroy data:

https://bugzilla.redhat.com/show_bug.cgi?id=2326393

Every  available information has been added to that ticket, here a short 
summery:

Host OS: Proxmox / Debian
Guest-OS: Fedora 40
When: right after upgrade to Fedora 40 from 39.
Type: QEMU
Storage IO Cap: 7 GB/s read ( GigaByte, not Gb/s ;) )

Issue:  High Performance SCSI Bus crash in SCSI Subsystem,

BUT: it's not only the kernel, the trigger MUST be a difference in the 
distribution between F39 and F40,

BECAUSE: after the upgrade to F40, the F39 kernel, which worked for 6 
months, had the same issues.

The FIX was, to switch the VM to use SATA connections to the drives.(dd 
copy of entire disks required)

...[REPEATING over and over again]...
Nov14 23:49] sd 2:0:0:0: [sda] tag#76 ABORT operation started
[  +4,898867] sd 2:0:0:0: ABORT operation timed-out.
[  +0,000008] sd 2:0:1:0: [sdb] tag#383 ABORT operation started
[  +4,914194] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000014] sd 2:0:1:0: [sdb] tag#322 ABORT operation started
[  +4,916147] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000007] sd 2:0:1:0: [sdb] tag#321 ABORT operation started
[  +4,914256] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000009] sd 2:0:1:0: [sdb] tag#320 ABORT operation started
[  +4,915142] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000006] sd 2:0:1:0: [sdb] tag#323 ABORT operation started
[  +4,915177] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000007] sd 2:0:1:0: [sdb] tag#324 ABORT operation started
[  +4,915221] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000007] sd 2:0:1:0: [sdb] tag#325 ABORT operation started
[Nov14 23:50] sd 2:0:1:0: ABORT operation timed-out.
[  +0,009594] scsi target2:0:0: TARGET RESET operation started
[  +4,905550] scsi target2:0:0: TARGET RESET operation timed-out.
[  +0,000005] scsi target2:0:1: TARGET RESET operation started
[ +34,407328] scsi target2:0:1: TARGET RESET operation timed-out.
[  +9,829432] sd 2:0:0:0: [sda] tag#76 ABORT operation started
[  +0,002336] sym0: SCSI BUS reset detected.
[  +0,002062] sd 2:0:0:0: ABORT operation complete.
[  +0,005062] sym0: SCSI BUS has been reset.
[  +3,193215] sd 2:0:1:0: Power-on or device reset occurred
[  +0,000057] sd 2:0:0:0: [sda] tag#76 BUS RESET operation started
[  +0,002319] sd 2:0:0:0: BUS RESET operation complete.
[  +0,000008] sym0: SCSI BUS reset detected.
[  +0,006945] sym0: SCSI BUS has been reset.
...[REPEATING over and over again]...

There is a longer list in the ticket and in the fedora dev mailinglist, including
write errors:

[  +0,000000] I/O error, dev sdb, sector 41792 op 0x1:(WRITE) flags 0xc800 phys_seg 96 prio class 2
[  +0,000001] I/O error, dev sdb, sector 41792 op 0x1:(WRITE) flags 0xc800 phys_seg 96 prio class 2
[  +0,000002] sd 2:0:1:0: [sdb] tag#395 timing out command, waited 180s

The disconnects happend as soon as the io system got under high pressure:

- Tar Backups
- database upgrades on the low level structure
- database access that generated a lot of disk io.

The drives returned after 5-10 minutes of reseting the drives.

O== Solution for VM users:

Copy data to virtual SATA controller bases harddrives and delete the old scsi ones. Solved the issue on the spot.

O== Informed

Fedora Dev ML
Fedora Kernel Team
Proxmox Forum
You