Re: Error 1 & scsi_add_device()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is very similar to the race conditions we've seen previously on
the sdev struct when doing dels followed by add's. It usually croaked
in the kref or class/object code when it did so.

The starget had a similar race condition, and did use the xxx_DEL state.
We do need to add the same type of thing in the sdev case, and it needs
to wait for the xxx_DEL state to clear (e.g. lookup of the sdev eventually
fails so you can alloc a new sdev) just like the target code does. So, I
think it's a little more than what you have proposed below.

-- james s

Salyzyn, Mark wrote:
With additional research I discover:

- scsi_remove_device for the nexus finds /dev/sda and marks it deleted
(SDEV_DEL)
- scsi_add_device for the nexus adds /dev/sdb (A new device)
- Subsequent scsi_device_lookup for the nexus finds /dev/sda, sees that
it is marked deleted via scsi_device_get and returns NULL rather than
progressing to the /dev/sdb node that shares the same nexus.
- subsequent scsi_remove_device for the nexus fail because they keep on
effectively finding /dev/sda with scsi_device_lookup to acquire the
device reference.
- subsequent scsi_add_device for the nexus fail because /dev/sdb already
exists.

None of this leads me to believe there is any kref node corruption, but
code could expect that if a device existed at the nexus and the
subsystem acquired another reference to the node based on the nexus
rather than the scsi_device, thus using scsi_device_lookup, that they
would get an unexpected NULL pointer and choke. I have not inspected the
code for such a path (yet), but feel we have risks in any case that need
to be addressed.

The aacraid driver should stop calling scsi_remove_device when an array
is deleted ... or ...

I believe what needs to be added is a check for sdev->sdev_state ==
SDEV_DEL in __scsi_device_lookup_by_target and __scsi_device_lookup in
scsi.c:

  struct scsi_device *__scsi_device_lookup_by_target(struct scsi_target
*starget,
                                                   uint lun)
  {
        struct scsi_device *sdev;

        list_for_each_entry(sdev, &starget->devices,
same_target_siblings) {
+               if (sdev->sdev_state != SDEV_DEL && sdev->lun == lun)
-               if (sdev->lun ==lun)
                        return sdev;
        }
. . .
  struct scsi_device *__scsi_device_lookup(struct Scsi_Host *shost,
                uint channel, uint id, uint lun)
  {
        struct scsi_device *sdev;

        list_for_each_entry(sdev, &shost->__devices, siblings) {
+               if (sdev->sdev_state != SDEV_DEL && sdev->channel ==
channel && sdev->id == id &&
-               if (sdev->channel == channel && sdev->id == id &&
                                sdev->lun ==lun)
                        return sdev;
        }

Sincerely -- Mark Salyzyn

-----Original Message-----
From: linux-scsi-owner@xxxxxxxxxxxxxxx [mailto:linux-scsi-owner@xxxxxxxxxxxxxxx] On Behalf Of Salyzyn, Mark
Sent: Monday, August 14, 2006 8:17 AM
To: linux-scsi@xxxxxxxxxxxxxxx
Cc: Mark Haverkamp
Subject: Error 1 & scsi_add_device()


The aacraid driver runs a kernel thread that monitors, amongst several
things, the array status events and will issue requests to add or remove
the scsi devices associated with the arrays.

Creating and deleting arrays on an aggressive scale with the aacraid
driver. Against 2.6.17.8 SMP kernel (has been tried on 2.6.13.2 and
2.6.17.7 as well) based on a FC4 Gold configuration, inbox or updated
driver we get a kernel panic that I believe could be tied to an 'Error
1' in the sysfs handler popping up after multiple scsi_add_device()
calls in a row. The second scsi_add_device calls result from a failure
of scsi_device_lookup to report the device on subsequent 'delete'
portion of the cycle and thus fails to issue the scsi_remove_device
call. This pattern repeats 10 times before the panic happens. In some
cases the panic occurs in add_device(), in the enclosed case it occurs
in scsi_is_host_device().

Failures sometimes take overnight to happen, sometimes they are as quick
as this one.

How bad are multiple calls to scsi_add_device()? In some of the cycles, we get read errors during the partition table reads that are part of the
scans because the array is being torn down while the scan is in
progress, could there be evil droppings in the partition table that add
misery in subsequent cycles?

Aug 11 13:51:36 Okapi kernel: Adaptec aacraid driver (1.1-5[2429]custom) Aug 11 13:51:36 Okapi kernel: ACPI: PCI Interrupt 0000:05:0e.0[A] -> GSI
18 (level, low) -> IRQ 17
Aug 11 13:51:36 Okapi kernel: aacraid0: kernel 5.1-0[8860] Aug 11 13:51:36 Okapi kernel: aacraid0: monitor 5.1-0[8860]
Aug 11 13:51:36 Okapi kernel: aacraid0: bios 5.1-0[8860]
Aug 11 13:51:36 Okapi kernel: aacraid0: serial c997fe
Aug 11 13:51:36 Okapi kernel: aacraid0: Non-DASD support enabled.
Aug 11 13:51:36 Okapi kernel: scsi4 : aacraid
Aug 11 13:51:36 Okapi kernel:   Vendor: Adaptec   Model: Device 1
Rev: V1.0
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 02
Aug 11 13:51:36 Okapi kernel: sda : very big device. try to use READ
CAPACITY(16).
Aug 11 13:51:36 Okapi kernel: SCSI device sda: 10741329920 512-byte hdwr
sectors (5499561 MB)
Aug 11 13:51:36 Okapi kernel: sda: assuming Write Enabled
Aug 11 13:51:36 Okapi kernel: sda: assuming drive cache: write through
Aug 11 13:51:36 Okapi kernel: sda : very big device. try to use READ
CAPACITY(16).
Aug 11 13:51:36 Okapi kernel: SCSI device sda: 10741329920 512-byte hdwr
sectors (5499561 MB)
Aug 11 13:51:36 Okapi kernel: sda: assuming Write Enabled
Aug 11 13:51:36 Okapi kernel: sda: assuming drive cache: write through
Aug 11 13:51:36 Okapi kernel:  sda: unknown partition table
Aug 11 13:51:36 Okapi kernel: sd 4:0:0:0: Attached scsi removable disk
sda
Aug 11 13:51:36 Okapi kernel: sd 4:0:0:0: Attached scsi generic sg1 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:8:0: Attached scsi generic sg2 type 0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:9:0: Attached scsi generic sg3 type 0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:10:0: Attached scsi generic sg4 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:11:0: Attached scsi generic sg5 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:12:0: Attached scsi generic sg6 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:13:0: Attached scsi generic sg7 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:14:0: Attached scsi generic sg8 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:15:0: Attached scsi generic sg9 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:16:0: Attached scsi generic sg10 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:17:0: Attached scsi generic sg11 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:18:0: Attached scsi generic sg12 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: ST350064  Model: 1AS
Rev: 3.AA
Aug 11 13:51:36 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:1:19:0: Attached scsi generic sg13 type
0
Aug 11 13:51:36 Okapi kernel:   Vendor: Newisys   Model: SANbloc S50
Rev: T024
Aug 11 13:51:36 Okapi kernel:   Type:   Enclosure
ANSI SCSI revision: 05
Aug 11 13:51:36 Okapi kernel: 4:3:0:0: Attached scsi generic sg14 type
13
. . .
Aug 11 15:46:08 Okapi kernel:
device=scsi_device_lookup(host4,0,0,0)
scsi_remove_device(device)
scsi_device_put(device)
		Note: This is the last time scsi_device_lookup() returns
a value.
. . .
		Cycle Mark
. . .
Aug 11 15:46:19 Okapi kernel: scsi_add_device(ffff810035b7c000{4}, 0, 0,
0)
Aug 11 15:46:19 Okapi kernel:   Vendor: Adaptec   Model: Device  1
Rev: V1.0
Aug 11 15:46:19 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 02
Aug 11 15:46:20 Okapi kernel: sdb : very big device. try to use READ
CAPACITY(16).
Aug 11 15:46:20 Okapi kernel: SCSI device sdb: 10741329920 512-byte hdwr
sectors (5499561 MB)
Aug 11 15:46:20 Okapi kernel: sdb: assuming Write Enabled
Aug 11 15:46:20 Okapi kernel: sdb: assuming drive cache: write through
Aug 11 15:46:20 Okapi kernel: sdb : very big device. try to use READ
CAPACITY(16).
Aug 11 15:46:20 Okapi kernel: SCSI device sdb: 10741329920 512-byte hdwr
sectors (5499561 MB)
Aug 11 15:46:20 Okapi kernel: sdb: assuming Write Enabled
Aug 11 15:46:20 Okapi kernel: sdb: assuming drive cache: write through
Aug 11 15:46:20 Okapi kernel:  sdb: unknown partition table
Aug 11 15:46:20 Okapi kernel: sd 4:0:0:0: Attached scsi removable disk
sdb
Aug 11 15:46:20 Okapi kernel: sd 4:0:0:0: Attached scsi generic sg1 type
0
. . .
Aug 11 15:46:34 Okapi kernel:
device=scsi_device_lookup(host4,0,0,0)=NULL
. . .
Aug 11 15:46:43 Okapi kernel: scsi_add_device(ffff810035b7c000{4}, 0, 0,
0)
Aug 11 15:46:44 Okapi kernel:   Vendor: Adaptec   Model: Device  1
Rev: V1.0
Aug 11 15:46:44 Okapi kernel:   Type:   Direct-Access
ANSI SCSI revision: 02
Aug 11 15:46:44 Okapi kernel: error 1
. . .
			Above cycle repeated 10 times sometimes with:
Aug 11 15:47:01 Okapi kernel: sd 4:0:0:0: SCSI error: return code =
0x8000002
Aug 11 15:47:01 Okapi kernel: sdb: Current: sense key: Hardware Error
Aug 11 15:47:01 Okapi kernel:     Additional sense: Internal target
failure
Aug 11 15:47:01 Okapi kernel: Info fld=0x0
Aug 11 15:47:01 Okapi kernel: end_request: I/O error, dev sdb, sector 0
Aug 11 15:47:01 Okapi kernel: Buffer I/O error on device sdb, logical
block 0
Aug 11 15:47:01 Okapi kernel: sd 4:0:0:0: SCSI error: return code =
0x8000002
Aug 11 15:47:01 Okapi kernel: sdb: Current: sense key: Hardware Error
Aug 11 15:47:01 Okapi kernel: sd 4:0:0:0: SCSI error: return code =
0x8000002
			During the scsi_add_device portion of the cycle.
. . .
Aug 11 15:51:11 Okapi kernel: scsi_add_device(ffff810035b7c000{4}, 0, 0,
0)
Aug 11 15:51:12 Okapi kernel: Unable to handle kernel NULL pointer
dereference at 0000000000000238 RIP: Aug 11 15:51:12 Okapi kernel: <ffffffff80338426>{scsi_is_host_device+2} Aug 11 15:51:12 Okapi kernel: PGD 316bf067 PUD 324d0067 PMD 0 Aug 11 15:51:12 Okapi kernel: Oops: 0000 [1] SMP Aug 11 15:51:12 Okapi kernel: CPU 1 Aug 11 15:51:12 Okapi kernel: Modules linked in: nfs lockd sunrpc lm85
hwmon_vid hwmon ext3 jbd video thermal processor fan button aacraid
i2c_i801 i2c_core mptspi sata_sil libata mptfc mptscsih mptctl mptstmod mptbase aic79xx scsi_transport_spi 3w_9xxx 3w_xxxx sg tg3 e1000 eepro100 mii dm_mod usb_storage usbhid uhci_hcd ohci_hcd ehci_hcd vfat fat linear
usbcore
Aug 11 15:51:12 Okapi kernel: Pid: 2369, comm: aacraid Not tainted
2.6.17.8 #1
Aug 11 15:51:12 Okapi kernel: RIP: 0010:[scsi_is_host_device+2/17]
<ffffffff80338426>{scsi_is_host_device+2}
Aug 11 15:51:12 Okapi kernel: RIP: 0010:[<ffffffff80338426>]
<ffffffff80338426>{scsi_is_host_device+2}
Aug 11 15:51:12 Okapi kernel: RSP: 0018:ffff810035723d30  EFLAGS:
00010246
Aug 11 15:51:12 Okapi kernel: RAX: 0000000000000000 RBX:
0000000000000000 RCX: ffff810035723dc8
Aug 11 15:51:12 Okapi kernel: RDX: 0000000000000000 RSI:
0000000000000000 RDI: 0000000000000000
Aug 11 15:51:12 Okapi kernel: RBP: ffff810035b7c000 R08:
0000000000000001 R09: 0000000000000000
Aug 11 15:51:12 Okapi kernel: R10: 00000000ffffffff R11:
0000000000000000 R12: 0000000000000000
Aug 11 15:51:12 Okapi kernel: R13: 0000000000000000 R14:
0000000000000001 R15: 0000000000000000
Aug 11 15:51:12 Okapi kernel: FS:  0000000000000000(0000)
GS:ffff810001fa34c0(0000) knlGS:0000000000000000
Aug 11 15:51:12 Okapi kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Aug 11 15:51:12 Okapi kernel: CR2: 0000000000000238 CR3:
0000000031244000 CR4: 00000000000006e0
Aug 11 15:51:12 Okapi kernel: Process aacraid (pid: 2369, threadinfo
ffff810035722000, task ffff81003f9baf20)
Aug 11 15:51:12 Okapi kernel: Stack: ffffffff8033e2fb ffff810035723dc8
0000000000000000 ffff810035bc6000 Aug 11 15:51:12 Okapi kernel: ffffffff8033dfa1 ffff810035670118 0000000000000000 ffff810035b7c160 Aug 11 15:51:12 Okapi kernel: ffff810033588980 0000000000000296 Aug 11 15:51:12 Okapi kernel: Call Trace:
<ffffffff8033e2fb>{scsi_probe_and_add_lun+66}
Aug 11 15:51:12 Okapi kernel:
<ffffffff8033dfa1>{scsi_alloc_target+142}
<ffffffff8033f4ab>{__scsi_add_device+119}
Aug 11 15:51:12 Okapi kernel:        <5>sdb : very big device. try to
use READ CAPACITY(16).
Aug 11 15:51:12 Okapi kernel: SCSI device sdb: 9764843520 512-byte hdwr
sectors (4999600 MB)
Aug 11 15:51:12 Okapi kernel: sdb: assuming Write Enabled
Aug 11 15:51:12 Okapi kernel: sdb: assuming drive cache: write through
Aug 11 15:51:12 Okapi kernel:
sdb:<ffffffff8033f4e1>{scsi_add_device+10}
<ffffffff88172126>{:aacraid:aac_handle_aif+1353}
Aug 11 15:51:12 Okapi kernel:
<ffffffff88172962>{:aacraid:aac_command_thread+372}
Aug 11 15:51:12 Okapi kernel:
<ffffffff802228fb>{default_wake_function+0}
<ffffffff881727ee>{:aacraid:aac_command_thread+0}
Aug 11 15:51:12 Okapi kernel:
<ffffffff802384b4>{keventd_create_kthread+0}
<ffffffff802386fc>{kthread+203}
Aug 11 15:51:12 Okapi kernel:        <ffffffff8020a582>{child_rip+8}
<ffffffff802384b4>{keventd_create_kthread+0}
Aug 11 15:51:12 Okapi kernel:        <ffffffff80238631>{kthread+0}
<ffffffff8020a57a>{child_rip+0}
Aug 11 15:51:12 Okapi kernel: Aug 11 15:51:12 Okapi kernel: Code: 48 81 bf 38 02 00 00 12 8c 33 80 0f 94 c0 c3 48 81 ef 40 02 Aug 11 15:51:12 Okapi kernel: RIP
<ffffffff80338426>{scsi_is_host_device+2} RSP <ffff810035723d30>
Aug 11 15:51:12 Okapi kernel: CR2: 0000000000000238
Aug 11 15:51:12 Okapi kernel:  unknown partition table

Sincerely -- Mark Salyzyn
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux