On 9/14/23 06:22, John David Anglin wrote: > On 2023-09-13 1:58 p.m., John David Anglin wrote: >> On 2023-09-12 5:53 p.m., John David Anglin wrote: >>> On 2023-09-10 5:30 p.m., John David Anglin wrote: >>>> Hi Masahiro, >>>> >>>> The attached change fixed boot at ddb5cdbafaaa 😁 >>>> >>>> However, v6.5.x boot is still broken: >>>> >>>> Run /init as init process >>>> process '/usr/bin/sh' started with executable stack >>>> Loading, please wait... >>>> Starting systemd-udevd version 254.1-3 >>>> e1000 alternatives: applied 0 out of 569 patches >>>> e1000: Intel(R) PRO/1000 Network Driver >>>> e1000: Copyright (c) 1999-2006 Intel Corporation. >>>> scsi_mod alternatives: applied 0 out of 7 patches >>>> SCSI subsystem initialized >>>> usbcore alternatives: applied 0 out of 18 patches >>>> usbcore: registered new interface driver usbfs >>>> libata alternatives: applied 0 out of 3 patches >>>> usbcore: registered new interface driver hub >>>> usbcore: registered new device driver usb >>>> mptbase alternatives: applied 0 out of 73 patches >>>> ehci_hcd alternatives: applied 0 out of 114 patches >>>> sata_sil24 alternatives: applied 0 out of 56 patches >>>> Fusion MPT base driver 3.04.20 >>>> Copyright (c) 1999-2008 LSI Corporation >>>> sata_sil24 0000:00:01.0: Applying completion IRQ loss on PCI-X errata fix >>>> scsi host0: sata_sil24 >>>> scsi host1: sata_sil24 >>>> pata_sil680 0000:60:02.0: sil680: 133MHz clock. >>>> scsi host2: sata_sil24 >>>> ehci_pci alternatives: applied 0 out of 2 patches >>>> ohci_hcd alternatives: applied 0 out of 144 patches >>>> ehci-pci 0000:60:01.2: EHCI Host Controller >>>> scsi host3: pata_sil680 >>>> ehci-pci 0000:60:01.2: new USB bus registered, assigned bus number 1 >>>> scsi host4: sata_sil24 >>>> ata1: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80080000 ir6 >>>> ata2: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80082000 ir6 >>>> ata3: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80084000 ir6 >>>> ata4: SATA max UDMA/100 host m128@0xffffffff80088000 port 0xffffffff80086000 ir6 >>>> e1000 0000:60:03.0 eth0: (PCI:33MHz:32-bit) 00:11:0a:31:8a:77 >>>> ehci-pci 0000:60:01.2: irq 71, io mem 0xffffffffb00a1000 >>>> scsi host5: pata_sil680 >>>> ata5: PATA max UDMA/133 cmd 0x26058 ctl 0x26064 bmdma 0x26040 irq 72 >>>> ata6: PATA max UDMA/133 cmd 0x26050 ctl 0x26060 bmdma 0x26048 irq 72 >>>> e1000 0000:60:03.0 eth0: Intel(R) PRO/1000 Network Connection >>>> ehci-pci 0000:60:01.2: USB 2.0 started, EHCI 0.95 >>>> usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.05 >>>> usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 >>>> usb usb1: Product: EHCI Host Controller >>>> usb usb1: Manufacturer: Linux 6.5.2-dirty ehci_hcd >>>> usb usb1: SerialNumber: 0000:60:01.2 >>>> hub 1-0:1.0: USB hub found >>>> hub 1-0:1.0: 5 ports detected >>>> ata1: SATA link down (SStatus 0 SControl 0) >>>> ata2: SATA link down (SStatus 0 SControl 0) >>>> ata3: SATA link down (SStatus 0 SControl 0) >>>> ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 0) >>>> ata4.00: ATA-10: ST4000VN008-2DR166, SC60, max UDMA/133 >>>> ata4.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32) >>>> ata4.00: configured for UDMA/100 >>>> scsi 4:0:0:0: Direct-Access ATA ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 5 >>>> ata6.00: ATAPI: HL-DT-STDVD+-RW GSA-H21L, 1.04, max UDMA/44 >>>> scsi 5:0:0:0: CD-ROM HL-DT-ST DVD+-RW GSA-H21L 1.04 PQ: 0 ANSI: 5 >>>> random: crng init done >>>> Timed out for waiting the udev queue being empty. >>>> Begin: Loading essential drivers ... done. >>>> Begin: Running /scripts/init-premount ... done. >>>> Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done. >>>> Begin: Running /scripts/local-premount ... done. >>>> Timed out for waiting the udev queue being empty. >>>> Begin: Waiting for root file system ... Begin: Running /scripts/local-block .... >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> Begin: Running /scripts/local-block ... done. >>>> done. >>>> Gave up waiting for root file system device. Common problems: >>>> - Boot args (cat /proc/cmdline) >>>> - Check rootdelay= (did the system wait long enough?) >>>> - Missing modules (cat /proc/modules; ls /dev) >>>> ALERT! LABEL=ROOT does not exist. Dropping to a shell! >>>> Rebooting automatically due to panic= boot argument >>>> >>>> I'll see if I can find the commit that breaks 6.5. >>> I've traced this to the following merge commit: >>> >>> dave@atlas:~/linux/linux$ git bisect good >>> ca7ce08d6a063e0ccb91dc57f9bc213120d0d1a7 is the first bad commit >>> commit ca7ce08d6a063e0ccb91dc57f9bc213120d0d1a7 >>> Merge: 1546cd4bfda4 af92c02fb209 >>> Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> >>> Date: Fri Jun 30 11:57:07 2023 -0700 >>> >>> Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi >>> >>> Pull SCSI updates from James Bottomley: >>> "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, >>> lpfc, qla2xxx). >>> >>> We have a couple of major core changes impacting other systems: >>> >>> - Command Duration Limits, which spills into block and ATA >>> >>> - block level Persistent Reservation Operations, which touches block, >>> nvme, target and dm >>> >>> Both of these are added with merge commits containing a cover letter >>> explaining what's going on" >>> >>> * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) >>> scsi: core: Improve warning message in scsi_device_block() >>> scsi: core: Replace scsi_target_block() with scsi_block_targets() >>> scsi: core: Don't wait for quiesce in scsi_device_block() >>> scsi: core: Don't wait for quiesce in scsi_stop_queue() >>> scsi: core: Merge scsi_internal_device_block() and device_block() >>> scsi: sg: Increase number of devices >>> scsi: bsg: Increase number of devices >>> scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue >>> scsi: ufs: ufs-pci: Add support for Intel Arrow Lake >>> scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT >>> scsi: ufs: wb: Add explicit flush_threshold sysfs attribute >>> scsi: ufs: ufs-qcom: Switch to the new ICE API >>> scsi: ufs: dt-bindings: qcom: Add ICE phandle >>> scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk >>> scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk >>> scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC >>> scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR >>> scsi: ufs: core: Remove dedicated hwq for dev command >>> scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command >>> scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes >>> ... >>> >>> dave@atlas:~/linux/linux$ lspci >>> 00:01.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02) >>> 40:01.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) >>> 40:01.1 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) >>> 60:01.0 USB controller: NEC Corporation OHCI USB Controller (rev 41) >>> 60:01.1 USB controller: NEC Corporation OHCI USB Controller (rev 41) >>> 60:01.2 USB controller: NEC Corporation uPD72010x USB 2.0 Controller (rev 02) >>> 60:02.0 IDE interface: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host Controller (rev 02) >>> 60:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02) >> This was introduced by the following commit: >> >> dave@atlas:~/linux/linux$ git bisect good >> 624885209f31eb9985bf51abe204ecbffe2fdeea is the first bad commit >> commit 624885209f31eb9985bf51abe204ecbffe2fdeea >> Author: Damien Le Moal <dlemoal@xxxxxxxxxx> >> Date: Thu May 11 03:13:41 2023 +0200 >> >> scsi: core: Detect support for command duration limits >> >> Introduce the function scsi_cdl_check() to detect if a device supports >> command duration limits (CDL). Support for the READ 16, WRITE 16, READ 32 >> and WRITE 32 commands are checked using the function scsi_report_opcode() >> to probe the rwcdlp and cdlp bits as they indicate the mode page defining >> the command duration limits descriptors that apply to the command being >> tested. >> >> If any of these commands support CDL, the field cdl_supported of struct >> scsi_device is set to 1 to indicate that the device supports CDL. >> >> Support for CDL for a device is advertizes through sysfs using the new >> cdl_supported device attribute. This attribute value is 1 for a device >> supporting CDL and 0 otherwise. >> >> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> >> Reviewed-by: Hannes Reinecke <hare@xxxxxxx> >> Co-developed-by: Niklas Cassel <niklas.cassel@xxxxxxx> >> Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> >> Link: https://lore.kernel.org/r/20230511011356.227789-9-nks@xxxxxxxxxxx >> Signed-off-by: Martin K. Petersen <martin.petersen@xxxxxxxxxx> >> >> Documentation/ABI/testing/sysfs-block-device | 9 ++++ >> drivers/scsi/scsi.c | 81 ++++++++++++++++++++++++++++ >> drivers/scsi/scsi_scan.c | 3 ++ >> drivers/scsi/scsi_sysfs.c | 2 + >> include/scsi/scsi_device.h | 3 ++ >> 5 files changed, 98 insertions(+) >> >> Sometimes I see when booting a bad commit: >> [...] >> Begin: Running /scripts/local-block ... done. >> Begin: Running /scripts/local-block ... done. >> Begin: Running /scripts/local-block ... done. >> done. >> Gave up waiting for root file system device. Common problems: >> - Boot args (cat /proc/cmdline) >> - Check rootdelay= (did the system wait long enough?) >> - Missing modules (cat /proc/modules; ls /dev) >> ALERT! LABEL=ROOT does not exist. Dropping to a shell! >> Rebooting automatically due to panic= boot argument >> ata4: SATA link down (SStatus 0 SControl 0) >> ata5: SATA link down (SStatus 0 SControl 0) >> ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 0) >> ata6.00: ATA-10: ST4000VN008-2DR166, SC60, max UDMA/133 >> ata6.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32) >> ata6.00: configured for UDMA/100 >> scsi 5:0:0:0: Direct-Access ATA ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 5 > > System boots master at e56b2b605799 if I disable CDL: > > dave@atlas:~/linux/linux$ git diff drivers/scsi/scsi.c > diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c > index d0911bc28663..dc3a283ebd75 100644 > --- a/drivers/scsi/scsi.c > +++ b/drivers/scsi/scsi.c > @@ -578,6 +578,8 @@ static bool scsi_cdl_check_cmd(struct scsi_device *sdev, u8 opcode, u16 sa, > int ret; > u8 cdlp; > > + return false; > + > /* Check operation code */ > ret = scsi_report_opcode(sdev, buf, SCSI_CDL_CHECK_BUF_LEN, opcode, sa); > if (ret <= 0) It is weird that this solves anything... the MAINTENANCE_IN command issued by scsi_report_opcode() ends up being emulated in libata with ata_scsiop_maint_in(). There are no actual commands issued to the drive, so nothing that could actually fail/cause issues. By the time this is issued, the ATA drive is also fully probed... Or is the drive connected to the Broadcom HBA you have ? In that case, libata is not used and the HBA FW SAT (scsi-ata-translation) is likely to blame. Could you send a full dmesg output for a clean boot and for a failed one so that I can compare ? -- Damien Le Moal Western Digital Research