Hi,
a bug in the scsi subsystem has been discovered in conjunction with
Fedora 40, that can&will destroy data:
https://bugzilla.redhat.com/show_bug.cgi?id=2326393
Every available information has been added to that ticket, here a short
summery:
Host OS: Proxmox / Debian
Guest-OS: Fedora 40
When: right after upgrade to Fedora 40 from 39.
Type: QEMU
Storage IO Cap: 7 GB/s read ( GigaByte, not Gb/s ;) )
Issue: High Performance SCSI Bus crash in SCSI Subsystem,
BUT: it's not only the kernel, the trigger MUST be a difference in the
distribution between F39 and F40,
BECAUSE: after the upgrade to F40, the F39 kernel, which worked for 6
months, had the same issues.
The FIX was, to switch the VM to use SATA connections to the drives.(dd
copy of entire disks required)
...[REPEATING over and over again]...
Nov14 23:49] sd 2:0:0:0: [sda] tag#76 ABORT operation started
[ +4,898867] sd 2:0:0:0: ABORT operation timed-out.
[ +0,000008] sd 2:0:1:0: [sdb] tag#383 ABORT operation started
[ +4,914194] sd 2:0:1:0: ABORT operation timed-out.
[ +0,000014] sd 2:0:1:0: [sdb] tag#322 ABORT operation started
[ +4,916147] sd 2:0:1:0: ABORT operation timed-out.
[ +0,000007] sd 2:0:1:0: [sdb] tag#321 ABORT operation started
[ +4,914256] sd 2:0:1:0: ABORT operation timed-out.
[ +0,000009] sd 2:0:1:0: [sdb] tag#320 ABORT operation started
[ +4,915142] sd 2:0:1:0: ABORT operation timed-out.
[ +0,000006] sd 2:0:1:0: [sdb] tag#323 ABORT operation started
[ +4,915177] sd 2:0:1:0: ABORT operation timed-out.
[ +0,000007] sd 2:0:1:0: [sdb] tag#324 ABORT operation started
[ +4,915221] sd 2:0:1:0: ABORT operation timed-out.
[ +0,000007] sd 2:0:1:0: [sdb] tag#325 ABORT operation started
[Nov14 23:50] sd 2:0:1:0: ABORT operation timed-out.
[ +0,009594] scsi target2:0:0: TARGET RESET operation started
[ +4,905550] scsi target2:0:0: TARGET RESET operation timed-out.
[ +0,000005] scsi target2:0:1: TARGET RESET operation started
[ +34,407328] scsi target2:0:1: TARGET RESET operation timed-out.
[ +9,829432] sd 2:0:0:0: [sda] tag#76 ABORT operation started
[ +0,002336] sym0: SCSI BUS reset detected.
[ +0,002062] sd 2:0:0:0: ABORT operation complete.
[ +0,005062] sym0: SCSI BUS has been reset.
[ +3,193215] sd 2:0:1:0: Power-on or device reset occurred
[ +0,000057] sd 2:0:0:0: [sda] tag#76 BUS RESET operation started
[ +0,002319] sd 2:0:0:0: BUS RESET operation complete.
[ +0,000008] sym0: SCSI BUS reset detected.
[ +0,006945] sym0: SCSI BUS has been reset.
...[REPEATING over and over again]...
There is a longer list in the ticket and in the fedora dev mailinglist, including
write errors:
[ +0,000000] I/O error, dev sdb, sector 41792 op 0x1:(WRITE) flags 0xc800 phys_seg 96 prio class 2
[ +0,000001] I/O error, dev sdb, sector 41792 op 0x1:(WRITE) flags 0xc800 phys_seg 96 prio class 2
[ +0,000002] sd 2:0:1:0: [sdb] tag#395 timing out command, waited 180s
The disconnects happend as soon as the io system got under high pressure:
- Tar Backups
- database upgrades on the low level structure
- database access that generated a lot of disk io.
The drives returned after 5-10 minutes of reseting the drives.
O== Solution for VM users:
Copy data to virtual SATA controller bases harddrives and delete the old scsi ones. Solved the issue on the spot.
O== Informed
Fedora Dev ML
Fedora Kernel Team
Proxmox Forum
You