Hot Swap Problems with LSI HBA and LSI Backplane -- reproducable and very frustrating

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

I'm having problems with two systems where hot-swapping sata drives results in their bay being permanently disabled until I cold boot the system. My hardware configuration is fairly straight forward:

Host Bus Adapter: LSI SAS9207-8i  (contains the LSISAS2308)
Case: Supermicro SuperChassis 826E2-R800LPB (contains the BPN-SAS-826EL2 backplane)
Backplane: Supermicro BPN-SAS-826EL2 (contains two LSISASx28 SAS Expanders)
Hard Drives: Western Digital WD3000BLFS-01YBU4, Western Digital WD20EARS, Seagate ST3000DM001, Seagate ST4000DM000 (I have many other types and sizes to test with)

Some links to technical information that might be relevant:
LSI SAS9207-8i Host Bus Adapterhttp://www.lsi.com/products/storagecomponents/Pages/LSISAS9207-8i.aspx#two LSISAS2308 http://www.lsi.com/products/storagecomponents/Pages/LSISAS2308.aspx Supermicro SuperChassis 826E2-R800LPB http://www.supermicro.com/products/chassis/2u/826/sc826e2-r800lp.cfm LSISASx28 SAS Expander http://www.lsi.com/products/storagecomponents/Pages/LSISASx28.aspx

Problem in detail
Ultimately I will be booting from a software RAID1 from the 12 drives in this system. During my testing I discovered this problem and I have been booting from a Gentoo USB drive so I can test all 12 SAS bays (labeled SAS0 through SAS11 on the backplane). If I boot the system from the USB drive, then insert a Western Digital WD3000BLFS-01YBU4 into SAS0, the drive spins up and is detected. Everything works as expected. I can pull the drive, mpt2sas removes the handle and I can repeate the process with the other SAS1 through SAS11 bays. Repeating the process with a Western Digital WD20EARS has the same results. All 12 bays work. Repeating with a Seagate ST4000DM000 and I find that some bays do not spin up the drive. When this happens that bay is dead and I can even use the previously working Western Digital WD3000BLFS-01YBU4 in it. The only thing that gets the bays working again is a cold boot after powering off the system and actually unplugging it for an extended period (>5 minutes).

While doing this testing I did see some strange errors in the kernel logs, but only after switching my HBA out for a Supermicro AOC-USAS2-L8i (which contains the LSISAS2008 and uses the same mpt2sas driver):
Testing SAS8 with ST4000DM000 worked (but there were strange kernel errors):
Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.322489] scsi 6:0:35:0: Direct-Access ATA ST4000DM000-1F21 CC51 PQ: 0 ANSI: 5 Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.322499] scsi 6:0:35:0: SATA: handle(0x000b), sas_addr(0x500304800105a94c), phy(12), device_name(0xc500500017534f84) Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.322503] scsi 6:0:35:0: SATA: enclosure_logical_id(0x50030442523a2033), slot(8) Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.322572] scsi 6:0:35:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.322575] scsi 6:0:35:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1) Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.322762] sd 6:0:35:0: Attached scsi generic sg2 type 0 Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.323340] sd 6:0:35:0: [sdb] physical block alignment offset: 4096 Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.323345] sd 6:0:35:0: [sdb] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB) Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.323347] sd 6:0:35:0: [sdb] 4096-byte physical blocks Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.400933] sd 6:0:35:0: [sdb] Write Protect is off Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.400938] sd 6:0:35:0: [sdb] Mode Sense: 73 00 00 08 Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.401764] sd 6:0:35:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.524835] sdb: sdb1 sdb2 sdb3 Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.527592] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0014 address=0x0000000010000000 flags=0x0020] Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.527598] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0014 address=0x0000000010000040 flags=0x0020] Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.527601] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0014 address=0x0000000010000010 flags=0x0020] Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.527609] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0014 address=0x0000000010000020 flags=0x0020] Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.613861] sd 6:0:35:0: [sdb] Attached SCSI disk
    Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.739109] md: bind<sdb2>
    Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.742970] md: bind<sdb3>
    Sep 17 22:23:18 gentoo-live-usb kernel: [ 1532.746619] md: bind<sdb1>
Removed ST4000DM000 from SAS8 and inserted it into SAS6:
Sep 17 22:23:49 gentoo-live-usb kernel: [ 1563.287575] mpt2sas0: removing handle(0x000b), sas_addr(0x500304800105a94c) Sep 17 22:24:16 gentoo-live-usb kernel: [ 1590.287517] mpt2sas0: device is not present handle(0x000b), no sas_device!!! Sep 17 22:24:26 gentoo-live-usb kernel: [ 1601.035876] mpt2sas0: removing handle(0x000a), sas_addr(0x500304800105a97d) Sep 17 22:24:26 gentoo-live-usb kernel: [ 1601.037113] mpt2sas0: expander_remove: handle(0x0009), sas_addr(0x500304800105a97f Removed ST4000DM000 from SAS6 and inserted into SAS8 failed. No activity in /var/log/messages. Drive does not spin up. Removed ST4000DM000 from SAS8 and inserted into SAS6 failed. No activity in /var/log/messages. Drive does not spin up.

The "device is not present" "no sas_device!!!" is interesting. What does it mean because there certainly is a drive in that SAS bay. I googled AMD-Vi and it seems related to IOMMU so i disabled that in the BIOS. I'm not doing PCI passthrough on this system but I did plan to use it as a Xen/KVM host later on. Disabling the IOMMU feature in the BIOS did suppress the AMD-Vi page fault, but I wonder if things are still broken somewhere and that is triggering other problems alter on which causes my SAS bays to get disabled untill I drain the power from the system.

Any help would be greatly appreciated.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux