On 9/14/23 09:29, John David Anglin wrote: > On 2023-09-13 7:45 p.m., Damien Le Moal wrote: >> On 9/14/23 06:22, John David Anglin wrote: >>> On 2023-09-13 1:58 p.m., John David Anglin wrote: >>>> On 2023-09-12 5:53 p.m., John David Anglin wrote: >>>>> On 2023-09-10 5:30 p.m., John David Anglin wrote: >>>>>> Hi Masahiro, >>>>>> >>>>>> The attached change fixed boot at ddb5cdbafaaa 😁 >>>>>> >>>>>> However, v6.5.x boot is still broken: >>>>>> >>>>>> Run /init as init process >>>>>> process '/usr/bin/sh' started with executable stack >>>>>> Loading, please wait... >>>>>> Starting systemd-udevd version 254.1-3 >>>>>> e1000 alternatives: applied 0 out of 569 patches >>>>>> e1000: Intel(R) PRO/1000 Network Driver >>>>>> e1000: Copyright (c) 1999-2006 Intel Corporation. >>>>>> scsi_mod alternatives: applied 0 out of 7 patches >>>>>> SCSI subsystem initialized >>>>>> usbcore alternatives: applied 0 out of 18 patches >>>>>> usbcore: registered new interface driver usbfs >>>>>> libata alternatives: applied 0 out of 3 patches >>>>>> usbcore: registered new interface driver hub >>>>>> usbcore: registered new device driver usb >>>>>> mptbase alternatives: applied 0 out of 73 patches >>>>>> ehci_hcd alternatives: applied 0 out of 114 patches >>>>>> sata_sil24 alternatives: applied 0 out of 56 patches >>>>>> Fusion MPT base driver 3.04.20 >>>>>> Copyright (c) 1999-2008 LSI Corporation >>>>>> sata_sil24 0000:00:01.0: Applying completion IRQ loss on PCI-X errata fix >>>>>> scsi host0: sata_sil24 >>>>>> scsi host1: sata_sil24 >>>>>> pata_sil680 0000:60:02.0: sil680: 133MHz clock. >>>>>> scsi host2: sata_sil24 >>>>>> ehci_pci alternatives: applied 0 out of 2 patches >>>>>> ohci_hcd alternatives: applied 0 out of 144 patches >>>>>> ehci-pci 0000:60:01.2: EHCI Host Controller >>>>>> scsi host3: pata_sil680 >>>>>> ehci-pci 0000:60:01.2: new USB bus registered, assigned bus number 1 >>>>>> scsi host4: sata_sil24 >>>>>> ata1: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80080000 ir6 >>>>>> ata2: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80082000 ir6 >>>>>> ata3: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80084000 ir6 >>>>>> ata4: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80086000 ir6 >>>>>> e1000 0000:60:03.0 eth0: (PCI:33MHz:32-bit) 00:11:0a:31:8a:77 >>>>>> ehci-pci 0000:60:01.2: irq 71, io mem 0xffffffffb00a1000 >>>>>> scsi host5: pata_sil680 >>>>>> ata5: PATA max UDMA/133 cmd 0x26058 ctl 0x26064 bmdma 0x26040 irq 72 >>>>>> ata6: PATA max UDMA/133 cmd 0x26050 ctl 0x26060 bmdma 0x26048 irq 72 >>>>>> e1000 0000:60:03.0 eth0: Intel(R) PRO/1000 Network Connection >>>>>> ehci-pci 0000:60:01.2: USB 2.0 started, EHCI 0.95 >>>>>> usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.05 >>>>>> usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 >>>>>> usb usb1: Product: EHCI Host Controller >>>>>> usb usb1: Manufacturer: Linux 6.5.2-dirty ehci_hcd >>>>>> usb usb1: SerialNumber: 0000:60:01.2 >>>>>> hub 1-0:1.0: USB hub found >>>>>> hub 1-0:1.0: 5 ports detected >>>>>> ata1: SATA link down (SStatus 0 SControl 0) >>>>>> ata2: SATA link down (SStatus 0 SControl 0) >>>>>> ata3: SATA link down (SStatus 0 SControl 0) >>>>>> ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 0) >>>>>> ata4.00: ATA-10: ST4000VN008-2DR166, SC60, max UDMA/133 >>>>>> ata4.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32) >>>>>> ata4.00: configured for UDMA/100 >>>>>> scsi 4:0:0:0: Direct-Access ATA ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 5 >>>>>> ata6.00: ATAPI: HL-DT-STDVD+-RW GSA-H21L, 1.04, max UDMA/44 >>>>>> scsi 5:0:0:0: CD-ROM HL-DT-ST DVD+-RW GSA-H21L 1.04 PQ: 0 ANSI: 5 >>>>>> random: crng init done >>>>>> Timed out for waiting the udev queue being empty. >>>>>> Begin: Loading essential drivers ... done. >>>>>> Begin: Running /scripts/init-premount ... done. >>>>>> Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done. >>>>>> Begin: Running /scripts/local-premount ... done. >>>>>> Timed out for waiting the udev queue being empty. >>>>>> Begin: Waiting for root file system ... Begin: Running /scripts/local-block .... >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> Begin: Running /scripts/local-block ... done. >>>>>> done. >>>>>> Gave up waiting for root file system device. Common problems: >>>>>> - Boot args (cat /proc/cmdline) >>>>>> - Check rootdelay= (did the system wait long enough?) >>>>>> - Missing modules (cat /proc/modules; ls /dev) >>>>>> ALERT! LABEL=ROOT does not exist. Dropping to a shell! >>>>>> Rebooting automatically due to panic= boot argument >>>>>> >>>>>> I'll see if I can find the commit that breaks 6.5. >>>>> I've traced this to the following merge commit: >>>>> >>>>> dave@atlas:~/linux/linux$ git bisect good >>>>> ca7ce08d6a063e0ccb91dc57f9bc213120d0d1a7 is the first bad commit >>>>> commit ca7ce08d6a063e0ccb91dc57f9bc213120d0d1a7 >>>>> Merge: 1546cd4bfda4 af92c02fb209 >>>>> Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> >>>>> Date: Fri Jun 30 11:57:07 2023 -0700 >>>>> >>>>> Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi >>>>> >>>>> Pull SCSI updates from James Bottomley: >>>>> "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, >>>>> lpfc, qla2xxx). >>>>> >>>>> We have a couple of major core changes impacting other systems: >>>>> >>>>> - Command Duration Limits, which spills into block and ATA >>>>> >>>>> - block level Persistent Reservation Operations, which touches block, >>>>> nvme, target and dm >>>>> >>>>> Both of these are added with merge commits containing a cover letter >>>>> explaining what's going on" >>>>> >>>>> * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) >>>>> scsi: core: Improve warning message in scsi_device_block() >>>>> scsi: core: Replace scsi_target_block() with scsi_block_targets() >>>>> scsi: core: Don't wait for quiesce in scsi_device_block() >>>>> scsi: core: Don't wait for quiesce in scsi_stop_queue() >>>>> scsi: core: Merge scsi_internal_device_block() and device_block() >>>>> scsi: sg: Increase number of devices >>>>> scsi: bsg: Increase number of devices >>>>> scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue >>>>> scsi: ufs: ufs-pci: Add support for Intel Arrow Lake >>>>> scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT >>>>> scsi: ufs: wb: Add explicit flush_threshold sysfs attribute >>>>> scsi: ufs: ufs-qcom: Switch to the new ICE API >>>>> scsi: ufs: dt-bindings: qcom: Add ICE phandle >>>>> scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk >>>>> scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk >>>>> scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC >>>>> scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR >>>>> scsi: ufs: core: Remove dedicated hwq for dev command >>>>> scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command >>>>> scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes >>>>> ... >>>>> >>>>> dave@atlas:~/linux/linux$ lspci >>>>> 00:01.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02) >>>>> 40:01.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) >>>>> 40:01.1 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) >>>>> 60:01.0 USB controller: NEC Corporation OHCI USB Controller (rev 41) >>>>> 60:01.1 USB controller: NEC Corporation OHCI USB Controller (rev 41) >>>>> 60:01.2 USB controller: NEC Corporation uPD72010x USB 2.0 Controller (rev 02) >>>>> 60:02.0 IDE interface: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host Controller (rev 02) >>>>> 60:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02) >>>> This was introduced by the following commit: >>>> >>>> dave@atlas:~/linux/linux$ git bisect good >>>> 624885209f31eb9985bf51abe204ecbffe2fdeea is the first bad commit >>>> commit 624885209f31eb9985bf51abe204ecbffe2fdeea >>>> Author: Damien Le Moal <dlemoal@xxxxxxxxxx> >>>> Date: Thu May 11 03:13:41 2023 +0200 >>>> >>>> scsi: core: Detect support for command duration limits >>>> >>>> Introduce the function scsi_cdl_check() to detect if a device supports >>>> command duration limits (CDL). Support for the READ 16, WRITE 16, READ 32 >>>> and WRITE 32 commands are checked using the function scsi_report_opcode() >>>> to probe the rwcdlp and cdlp bits as they indicate the mode page defining >>>> the command duration limits descriptors that apply to the command being >>>> tested. >>>> >>>> If any of these commands support CDL, the field cdl_supported of struct >>>> scsi_device is set to 1 to indicate that the device supports CDL. >>>> >>>> Support for CDL for a device is advertizes through sysfs using the new >>>> cdl_supported device attribute. This attribute value is 1 for a device >>>> supporting CDL and 0 otherwise. >>>> >>>> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> >>>> Reviewed-by: Hannes Reinecke <hare@xxxxxxx> >>>> Co-developed-by: Niklas Cassel <niklas.cassel@xxxxxxx> >>>> Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> >>>> Link: https://lore.kernel.org/r/20230511011356.227789-9-nks@xxxxxxxxxxx >>>> Signed-off-by: Martin K. Petersen <martin.petersen@xxxxxxxxxx> >>>> >>>> Documentation/ABI/testing/sysfs-block-device | 9 ++++ >>>> drivers/scsi/scsi.c | 81 ++++++++++++++++++++++++++++ >>>> drivers/scsi/scsi_scan.c | 3 ++ >>>> drivers/scsi/scsi_sysfs.c | 2 + >>>> include/scsi/scsi_device.h | 3 ++ >>>> 5 files changed, 98 insertions(+) >>>> >>>> Sometimes I see when booting a bad commit: >>>> [...] >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> done. >>>> Gave up waiting for root file system device. Common problems: >>>> - Boot args (cat /proc/cmdline) >>>> - Check rootdelay= (did the system wait long enough?) >>>> - Missing modules (cat /proc/modules; ls /dev) >>>> ALERT! LABEL=ROOT does not exist. Dropping to a shell! >>>> Rebooting automatically due to panic= boot argument >>>> ata4: SATA link down (SStatus 0 SControl 0) >>>> ata5: SATA link down (SStatus 0 SControl 0) >>>> ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 0) >>>> ata6.00: ATA-10: ST4000VN008-2DR166, SC60, max UDMA/133 >>>> ata6.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32) >>>> ata6.00: configured for UDMA/100 >>>> scsi 5:0:0:0: Direct-Access ATA ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 5 >>> System boots master at e56b2b605799 if I disable CDL: >>> >>> dave@atlas:~/linux/linux$ git diff drivers/scsi/scsi.c >>> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c >>> index d0911bc28663..dc3a283ebd75 100644 >>> --- a/drivers/scsi/scsi.c >>> +++ b/drivers/scsi/scsi.c >>> @@ -578,6 +578,8 @@ static bool scsi_cdl_check_cmd(struct scsi_device *sdev, u8 opcode, u16 sa, >>> int ret; >>> u8 cdlp; >>> >>> + return false; >>> + >>> /* Check operation code */ >>> ret = scsi_report_opcode(sdev, buf, SCSI_CDL_CHECK_BUF_LEN, opcode, sa); >>> if (ret <= 0) >> It is weird that this solves anything... the MAINTENANCE_IN command issued by >> scsi_report_opcode() ends up being emulated in libata with >> ata_scsiop_maint_in(). There are no actual commands issued to the drive, so >> nothing that could actually fail/cause issues. By the time this is issued, the >> ATA drive is also fully probed... >> >> Or is the drive connected to the Broadcom HBA you have ? In that case, libata is >> not used and the HBA FW SAT (scsi-ata-translation) is likely to blame. > /boot, / and swap partitions reside on a ST373207LW drive connected to a Broadcom HBA. A > ST4000VN008-2DR1 drive is connected to the Silicon Image, Inc. SiI 3124 PCI-X Serial > ATA Controller. It mounts on /home. There's also a cdrom connected to the Silicon > Image, Inc. PCI0680 Ultra ATA-133 Host Controller and another ST4000VN008-2DR1 drive > connected to a Broadcom HBA. There are two Broadcom HBAs. > > I think the issue is with the root ST373207LW drive. The console output indicates that the > ROOT drive doesn't exist when the boot fails. > > Your change only appeared to affect actual SCSI drives. That's why I tried disabling CDL. >> >> Could you send a full dmesg output for a clean boot and for a failed one so that >> I can compare ? > I'll try to get this together tomorrow. Please also tell me the scsi_level reported for that drive (cat /sys/block/sdX/device/scsi_level and output of sg_inq /dev/sdX). Thanks ! > > Dave > -- Damien Le Moal Western Digital Research