scsi bus disconnect on high load in qemu kvm anno 2024

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

a bug in the scsi subsystem has been discovered in conjunction with Fedora 40, that can&will destroy data:

https://bugzilla.redhat.com/show_bug.cgi?id=2326393

Every  available information has been added to that ticket, here a short summery:

Host OS: Proxmox / Debian
Guest-OS: Fedora 40
When: right after upgrade to Fedora 40 from 39.
Type: QEMU
Storage IO Cap: 7 GB/s read ( GigaByte, not Gb/s ;) )

Issue:  High Performance SCSI Bus crash in SCSI Subsystem,

BUT: it's not only the kernel, the trigger MUST be a difference in the distribution between F39 and F40,

BECAUSE: after the upgrade to F40, the F39 kernel, which worked for 6 months, had the same issues.

The FIX was, to switch the VM to use SATA connections to the drives.(dd copy of entire disks required)

...[REPEATING over and over again]...
Nov14 23:49] sd 2:0:0:0: [sda] tag#76 ABORT operation started
[  +4,898867] sd 2:0:0:0: ABORT operation timed-out.
[  +0,000008] sd 2:0:1:0: [sdb] tag#383 ABORT operation started
[  +4,914194] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000014] sd 2:0:1:0: [sdb] tag#322 ABORT operation started
[  +4,916147] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000007] sd 2:0:1:0: [sdb] tag#321 ABORT operation started
[  +4,914256] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000009] sd 2:0:1:0: [sdb] tag#320 ABORT operation started
[  +4,915142] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000006] sd 2:0:1:0: [sdb] tag#323 ABORT operation started
[  +4,915177] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000007] sd 2:0:1:0: [sdb] tag#324 ABORT operation started
[  +4,915221] sd 2:0:1:0: ABORT operation timed-out.
[  +0,000007] sd 2:0:1:0: [sdb] tag#325 ABORT operation started
[Nov14 23:50] sd 2:0:1:0: ABORT operation timed-out.
[  +0,009594] scsi target2:0:0: TARGET RESET operation started
[  +4,905550] scsi target2:0:0: TARGET RESET operation timed-out.
[  +0,000005] scsi target2:0:1: TARGET RESET operation started
[ +34,407328] scsi target2:0:1: TARGET RESET operation timed-out.
[  +9,829432] sd 2:0:0:0: [sda] tag#76 ABORT operation started
[  +0,002336] sym0: SCSI BUS reset detected.
[  +0,002062] sd 2:0:0:0: ABORT operation complete.
[  +0,005062] sym0: SCSI BUS has been reset.
[  +3,193215] sd 2:0:1:0: Power-on or device reset occurred
[  +0,000057] sd 2:0:0:0: [sda] tag#76 BUS RESET operation started
[  +0,002319] sd 2:0:0:0: BUS RESET operation complete.
[  +0,000008] sym0: SCSI BUS reset detected.
[  +0,006945] sym0: SCSI BUS has been reset.
...[REPEATING over and over again]...


There is a longer list in the ticket and in the fedora dev mailinglist, including
write errors:

[  +0,000000] I/O error, dev sdb, sector 41792 op 0x1:(WRITE) flags 0xc800 phys_seg 96 prio class 2
[  +0,000001] I/O error, dev sdb, sector 41792 op 0x1:(WRITE) flags 0xc800 phys_seg 96 prio class 2
[  +0,000002] sd 2:0:1:0: [sdb] tag#395 timing out command, waited 180s

The disconnects happend as soon as the io system got under high pressure:

- Tar Backups
- database upgrades on the low level structure
- database access that generated a lot of disk io.

The drives returned after 5-10 minutes of reseting the drives.

O== Solution for VM users:

Copy data to virtual SATA controller bases harddrives and delete the old scsi ones. Solved the issue on the spot.

O== Informed

Fedora Dev ML
Fedora Kernel Team
Proxmox Forum
You







[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux