On 2023-09-13 7:45 p.m., Damien Le Moal wrote:
On 9/14/23 06:22, John David Anglin wrote:
On 2023-09-13 1:58 p.m., John David Anglin wrote:
On 2023-09-12 5:53 p.m., John David Anglin wrote:
On 2023-09-10 5:30 p.m., John David Anglin wrote:
Hi Masahiro,
The attached change fixed boot at ddb5cdbafaaa 😁
However, v6.5.x boot is still broken:
Run /init as init process
process '/usr/bin/sh' started with executable stack
Loading, please wait...
Starting systemd-udevd version 254.1-3
e1000 alternatives: applied 0 out of 569 patches
e1000: Intel(R) PRO/1000 Network Driver
e1000: Copyright (c) 1999-2006 Intel Corporation.
scsi_mod alternatives: applied 0 out of 7 patches
SCSI subsystem initialized
usbcore alternatives: applied 0 out of 18 patches
usbcore: registered new interface driver usbfs
libata alternatives: applied 0 out of 3 patches
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
mptbase alternatives: applied 0 out of 73 patches
ehci_hcd alternatives: applied 0 out of 114 patches
sata_sil24 alternatives: applied 0 out of 56 patches
Fusion MPT base driver 3.04.20
Copyright (c) 1999-2008 LSI Corporation
sata_sil24 0000:00:01.0: Applying completion IRQ loss on PCI-X errata fix
scsi host0: sata_sil24
scsi host1: sata_sil24
pata_sil680 0000:60:02.0: sil680: 133MHz clock.
scsi host2: sata_sil24
ehci_pci alternatives: applied 0 out of 2 patches
ohci_hcd alternatives: applied 0 out of 144 patches
ehci-pci 0000:60:01.2: EHCI Host Controller
scsi host3: pata_sil680
ehci-pci 0000:60:01.2: new USB bus registered, assigned bus number 1
scsi host4: sata_sil24
ata1: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80080000 ir6
ata2: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80082000 ir6
ata3: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80084000 ir6
ata4: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80086000 ir6
e1000 0000:60:03.0 eth0: (PCI:33MHz:32-bit) 00:11:0a:31:8a:77
ehci-pci 0000:60:01.2: irq 71, io mem 0xffffffffb00a1000
scsi host5: pata_sil680
ata5: PATA max UDMA/133 cmd 0x26058 ctl 0x26064 bmdma 0x26040 irq 72
ata6: PATA max UDMA/133 cmd 0x26050 ctl 0x26060 bmdma 0x26048 irq 72
e1000 0000:60:03.0 eth0: Intel(R) PRO/1000 Network Connection
ehci-pci 0000:60:01.2: USB 2.0 started, EHCI 0.95
usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.05
usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb1: Product: EHCI Host Controller
usb usb1: Manufacturer: Linux 6.5.2-dirty ehci_hcd
usb usb1: SerialNumber: 0000:60:01.2
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 5 ports detected
ata1: SATA link down (SStatus 0 SControl 0)
ata2: SATA link down (SStatus 0 SControl 0)
ata3: SATA link down (SStatus 0 SControl 0)
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
ata4.00: ATA-10: ST4000VN008-2DR166, SC60, max UDMA/133
ata4.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata4.00: configured for UDMA/100
scsi 4:0:0:0: Direct-Access ATA ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 5
ata6.00: ATAPI: HL-DT-STDVD+-RW GSA-H21L, 1.04, max UDMA/44
scsi 5:0:0:0: CD-ROM HL-DT-ST DVD+-RW GSA-H21L 1.04 PQ: 0 ANSI: 5
random: crng init done
Timed out for waiting the udev queue being empty.
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
Timed out for waiting the udev queue being empty.
Begin: Waiting for root file system ... Begin: Running /scripts/local-block ....
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
done.
Gave up waiting for root file system device. Common problems:
- Boot args (cat /proc/cmdline)
- Check rootdelay= (did the system wait long enough?)
- Missing modules (cat /proc/modules; ls /dev)
ALERT! LABEL=ROOT does not exist. Dropping to a shell!
Rebooting automatically due to panic= boot argument
I'll see if I can find the commit that breaks 6.5.
I've traced this to the following merge commit:
dave@atlas:~/linux/linux$ git bisect good
ca7ce08d6a063e0ccb91dc57f9bc213120d0d1a7 is the first bad commit
commit ca7ce08d6a063e0ccb91dc57f9bc213120d0d1a7
Merge: 1546cd4bfda4 af92c02fb209
Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Fri Jun 30 11:57:07 2023 -0700
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi,
lpfc, qla2xxx).
We have a couple of major core changes impacting other systems:
- Command Duration Limits, which spills into block and ATA
- block level Persistent Reservation Operations, which touches block,
nvme, target and dm
Both of these are added with merge commits containing a cover letter
explaining what's going on"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits)
scsi: core: Improve warning message in scsi_device_block()
scsi: core: Replace scsi_target_block() with scsi_block_targets()
scsi: core: Don't wait for quiesce in scsi_device_block()
scsi: core: Don't wait for quiesce in scsi_stop_queue()
scsi: core: Merge scsi_internal_device_block() and device_block()
scsi: sg: Increase number of devices
scsi: bsg: Increase number of devices
scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue
scsi: ufs: ufs-pci: Add support for Intel Arrow Lake
scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT
scsi: ufs: wb: Add explicit flush_threshold sysfs attribute
scsi: ufs: ufs-qcom: Switch to the new ICE API
scsi: ufs: dt-bindings: qcom: Add ICE phandle
scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk
scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk
scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC
scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR
scsi: ufs: core: Remove dedicated hwq for dev command
scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command
scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes
...
dave@atlas:~/linux/linux$ lspci
00:01.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)
40:01.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)
40:01.1 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)
60:01.0 USB controller: NEC Corporation OHCI USB Controller (rev 41)
60:01.1 USB controller: NEC Corporation OHCI USB Controller (rev 41)
60:01.2 USB controller: NEC Corporation uPD72010x USB 2.0 Controller (rev 02)
60:02.0 IDE interface: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host Controller (rev 02)
60:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
This was introduced by the following commit:
dave@atlas:~/linux/linux$ git bisect good
624885209f31eb9985bf51abe204ecbffe2fdeea is the first bad commit
commit 624885209f31eb9985bf51abe204ecbffe2fdeea
Author: Damien Le Moal <dlemoal@xxxxxxxxxx>
Date: Thu May 11 03:13:41 2023 +0200
scsi: core: Detect support for command duration limits
Introduce the function scsi_cdl_check() to detect if a device supports
command duration limits (CDL). Support for the READ 16, WRITE 16, READ 32
and WRITE 32 commands are checked using the function scsi_report_opcode()
to probe the rwcdlp and cdlp bits as they indicate the mode page defining
the command duration limits descriptors that apply to the command being
tested.
If any of these commands support CDL, the field cdl_supported of struct
scsi_device is set to 1 to indicate that the device supports CDL.
Support for CDL for a device is advertizes through sysfs using the new
cdl_supported device attribute. This attribute value is 1 for a device
supporting CDL and 0 otherwise.
Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>
Reviewed-by: Hannes Reinecke <hare@xxxxxxx>
Co-developed-by: Niklas Cassel <niklas.cassel@xxxxxxx>
Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>
Link: https://lore.kernel.org/r/20230511011356.227789-9-nks@xxxxxxxxxxx
Signed-off-by: Martin K. Petersen <martin.petersen@xxxxxxxxxx>
Documentation/ABI/testing/sysfs-block-device | 9 ++++
drivers/scsi/scsi.c | 81 ++++++++++++++++++++++++++++
drivers/scsi/scsi_scan.c | 3 ++
drivers/scsi/scsi_sysfs.c | 2 +
include/scsi/scsi_device.h | 3 ++
5 files changed, 98 insertions(+)
Sometimes I see when booting a bad commit:
[...]
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
done.
Gave up waiting for root file system device. Common problems:
- Boot args (cat /proc/cmdline)
- Check rootdelay= (did the system wait long enough?)
- Missing modules (cat /proc/modules; ls /dev)
ALERT! LABEL=ROOT does not exist. Dropping to a shell!
Rebooting automatically due to panic= boot argument
ata4: SATA link down (SStatus 0 SControl 0)
ata5: SATA link down (SStatus 0 SControl 0)
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
ata6.00: ATA-10: ST4000VN008-2DR166, SC60, max UDMA/133
ata6.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata6.00: configured for UDMA/100
scsi 5:0:0:0: Direct-Access ATA ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 5
System boots master at e56b2b605799 if I disable CDL:
dave@atlas:~/linux/linux$ git diff drivers/scsi/scsi.c
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index d0911bc28663..dc3a283ebd75 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -578,6 +578,8 @@ static bool scsi_cdl_check_cmd(struct scsi_device *sdev, u8 opcode, u16 sa,
int ret;
u8 cdlp;
+ return false;
+
/* Check operation code */
ret = scsi_report_opcode(sdev, buf, SCSI_CDL_CHECK_BUF_LEN, opcode, sa);
if (ret <= 0)
It is weird that this solves anything... the MAINTENANCE_IN command issued by
scsi_report_opcode() ends up being emulated in libata with
ata_scsiop_maint_in(). There are no actual commands issued to the drive, so
nothing that could actually fail/cause issues. By the time this is issued, the
ATA drive is also fully probed...
Or is the drive connected to the Broadcom HBA you have ? In that case, libata is
not used and the HBA FW SAT (scsi-ata-translation) is likely to blame.
/boot, / and swap partitions reside on a ST373207LW drive connected to a Broadcom HBA. A
ST4000VN008-2DR1 drive is connected to the Silicon Image, Inc. SiI 3124 PCI-X Serial
ATA Controller. It mounts on /home. There's also a cdrom connected to the Silicon
Image, Inc. PCI0680 Ultra ATA-133 Host Controller and another ST4000VN008-2DR1 drive
connected to a Broadcom HBA. There are two Broadcom HBAs.
I think the issue is with the root ST373207LW drive. The console output indicates that the
ROOT drive doesn't exist when the boot fails.
Your change only appeared to affect actual SCSI drives. That's why I tried disabling CDL.
Could you send a full dmesg output for a clean boot and for a failed one so that
I can compare ?
I'll try to get this together tomorrow.
Dave
--
John David Anglin dave.anglin@xxxxxxxx