Update for anyone who saw this and wonders what happened: As more drives were added in, the systems got more and more unstable at boot. With <8 drives, it booted pretty every time. With 11 drives (described in my original post), it failed at init ~3 out of 4 times. Once I added the 12th drive I couldn't get it to come up any longer even after a dozen or so reboots; there would be timeouts in the pm80xx module, and no drives attached to that would show up. One suggestion was to use shorter cables -- but I couldn't use any cables shorter than 0.8m as that didn't fit in the chassis (supermicro 36-slot chassis). I also tried the latest kernels to no avail, and also tried adjusting the module init timeouts in the code to see if that made a difference (no difference). The adaptec card was at the latest firmware (and still is, there haven't been any updates), with the stock linux drivers for the pm80xx card. There was a comment that it's best to match the expander chip vendor (LSI SAS2X28 & SAS2X36) with the hba vendor - so I ended up replacing the adaptec 71605H with an LSI 9207-8i HBA (using a 1m cable to each expander). After the HBA swap, both (all) systems are working perfectly. On Fri, Sep 5, 2014 at 3:13 PM, Andrew Robertson <andyrobertson101@xxxxxxxxx> wrote: > More info, as requested: > > There are 2 sas expander chips in the system (LSI SAS2X28 & SAS2X36), > and there's a connection to each of them from the 71605H via a > separate 0.8m Adaptec cable. (Adaptec 2280200-R). This is a > Supermicro chassis. > > Firmware version: > # cat /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/scsi_host/host0/fw_version > 02.08.60.01 > > I don't have immediate physical access to the box, so I'm not able to > do the hotplug logging test. However, I did "reset" the PCI device > via /sys, as shown below, and captured the logs from that (attached, > "dmesg.out.txt"). > > With the latest kernel, v3.17-rc3, I got a kernel "null pointer > dereference" in the pm80xx module (dmesg output pasted in below). > > I will also try replacing the cables with 0.5m adaptec cables as > suggested to see if that helps. > > --- > > Reset test: > > # find /sys -iname logging_level > /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/scsi_host/host0/logging_level > # echo 0xfff > /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/scsi_host/host0/logging_level > # echo 1 > /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/rescan > # echo 1 > /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/reset > # sync > (at which point the process hung; in the dmesg you can see a "sync > blocked for more than 120 seconds") > > > The disk/expander layout looks like: > [0:0:0:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdb /dev/sg1 > [0:0:1:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdc /dev/sg2 > [0:0:2:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdd /dev/sg3 > [0:0:3:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sde /dev/sg4 > [0:0:4:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdf /dev/sg5 > [0:0:5:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdg /dev/sg6 > [0:0:6:0] enclosu LSI SAS2X36 0e12 - /dev/sg7 > [0:0:7:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdh /dev/sg8 > [0:0:8:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdi /dev/sg9 > [0:0:9:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdj /dev/sg10 > [0:0:10:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdk /dev/sg11 > [0:0:11:0] disk ATA WDC WD60EFRX-68M 82.0 /dev/sdl /dev/sg12 > [0:0:12:0] enclosu LSI SAS2X28 0e12 - /dev/sg13 > > > dmesg from latest kernel v3.17-rc3 showing (what appears to possibly > be) a kernel bug: > > This happened right after I ran "lsscsi" (though can't say if it was > actually caused by that). > > [ 309.327805] BUG: unable to handle kernel NULL pointer dereference > at 0000000000000290 > [ 309.335829] IP: [<ffffffffc0080d4f>] > pm8001_dev_gone_notify+0x2f/0x220 [pm80xx] > [ 309.343258] PGD 0 > [ 309.345381] Oops: 0000 [#1] SMP > [ 309.348797] Modules linked in: ipmi_devintf autofs4 arc4 nfsd > auth_rpcgss intel_rapl nfs_acl x86_pkg_temp_thermal nfs > intel_powerclamp coretemp lockd kvm_intel sunrpc kvm fscache > crct10dif_pclmul ttm crc32_pclmul drm_kms_helper rt2800usb > ghash_clmulni_intel rt2x00usb rt2800lib rt2x00lib mac80211 aesni_intel > drm aes_x86_64 lrw gf128mul glue_helper ablk_helper cfg80211 cryptd > crc_ccitt syscopyarea joydev sysfillrect sysimgblt shpchp lpc_ich > ipmi_si ipmi_msghandler mac_hid video ie31200_edac edac_core lp > parport ses enclosure hid_generic usbhid hid raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor async_tx xor igb > raid6_pq i2c_algo_bit raid1 e1000e pm80xx dca raid0 libsas ahci ptp > multipath scsi_transport_sas libahci pps_core linear > [ 309.420132] CPU: 5 PID: 1998 Comm: kworker/5:2 Not tainted > 3.17.0-031700rc3-generic #201409031132 > [ 309.429051] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 2.0 04/24/2014 > [ 309.436148] Workqueue: pm80xx pm8001_work_fn [pm80xx] > [ 309.441317] task: ffff880403690000 ti: ffff880404fa4000 task.ti: > ffff880404fa4000 > [ 309.448867] RIP: 0010:[<ffffffffc0080d4f>] [<ffffffffc0080d4f>] > pm8001_dev_gone_notify+0x2f/0x220 [pm80xx] > [ 309.458765] RSP: 0018:ffff880404fa7cd8 EFLAGS: 00010286 > [ 309.464145] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000006e02 > [ 309.471342] RDX: 0000000000000000 RSI: 0000000000000286 RDI: ffff880403e98000 > [ 309.478546] RBP: ffff880404fa7d18 R08: ffff880404fa4000 R09: 0000000000000000 > [ 309.485749] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8804022b8000 > [ 309.492954] R13: ffff880403e98000 R14: ffff880401b80180 R15: 0000000000000000 > [ 309.500163] FS: 0000000000000000(0000) GS:ffff88041fd40000(0000) > knlGS:0000000000000000 > [ 309.508337] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 309.514147] CR2: 0000000000000290 CR3: 0000000001c16000 CR4: 00000000001407e0 > [ 309.521348] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 309.528551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 309.535752] Stack: > [ 309.537837] ffff8804022b8000 ffff880401bbe000 ffff880401b80180 > ffff880403e98000 > [ 309.545599] ffff8804022b8000 ffff880401bbe000 ffff880401b80180 > 0000000000000000 > [ 309.553349] ffff880404fa7d78 ffffffffc00826e8 ffff880404fa7d78 > ffffffff810ababc > [ 309.561091] Call Trace: > [ 309.563611] [<ffffffffc00826e8>] > pm8001_I_T_nexus_event_handler+0xb8/0x1f0 [pm80xx] > [ 309.571449] [<ffffffff810ababc>] ? put_prev_entity+0x3c/0x320 > [ 309.577355] [<ffffffffc0084559>] pm8001_work_fn+0x299/0x480 [pm80xx] > [ 309.583869] [<ffffffff8108ce6f>] process_one_work+0x17f/0x490 > [ 309.589773] [<ffffffff8108d7eb>] worker_thread+0x11b/0x3f0 > [ 309.595413] [<ffffffff8108d6d0>] ? create_worker+0x1e0/0x1e0 > [ 309.601225] [<ffffffff81093349>] kthread+0xc9/0xe0 > [ 309.606176] [<ffffffff81093280>] ? flush_kthread_worker+0x90/0x90 > [ 309.612421] [<ffffffff817a3f3c>] ret_from_fork+0x7c/0xb0 > [ 309.617890] [<ffffffff81093280>] ? flush_kthread_worker+0x90/0x90 > [ 309.624135] Code: 00 55 48 89 e5 48 83 ec 40 4c 89 6d e8 4c 89 75 > f0 49 89 fd 4c 89 7d f8 48 89 5d d8 4c 89 65 e0 48 8b 47 30 48 > 8b 9f 78 01 00 00 <48> 8b 80 90 02 00 00 4c 8b a0 90 01 00 00 4d 8d 74 > 24 38 4c 89 > [ 309.647551] RIP [<ffffffffc0080d4f>] > pm8001_dev_gone_notify+0x2f/0x220 [pm80xx] > [ 309.655100] RSP <ffff880404fa7cd8> > [ 309.658657] CR2: 0000000000000290 > [ 309.662045] ---[ end trace 084eaa8941942e9a ]--- > [ 309.770413] BUG: unable to handle kernel paging request at ffffffffffffffd8 > [ 309.777578] IP: [<ffffffff810936e0>] kthread_data+0x10/0x20 > [ 309.783287] PGD 1c19067 PUD 1c1b067 PMD 0 > [ 309.787652] Oops: 0000 [#2] SMP > [ 309.791080] Modules linked in: ipmi_devintf autofs4 arc4 nfsd > auth_rpcgss intel_rapl nfs_acl x86_pkg_temp_thermal nfs > intel_powerclamp coretemp lockd kvm_intel sunrpc kvm fscache > crct10dif_pclmul ttm crc32_pclmul drm_kms_helper rt2800usb > ghash_clmulni_intel rt2x00usb rt2800lib rt2x00lib mac80211 aesni_intel > drm aes_x86_64 lrw gf128mul glue_helper ablk_helper cfg80211 cryptd > crc_ccitt syscopyarea joydev sysfillrect sysimgblt shpchp lpc_ich > ipmi_si ipmi_msghandler mac_hid video ie31200_edac edac_core lp > parport ses enclosure hid_generic usbhid hid raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor async_tx xor igb > raid6_pq i2c_algo_bit raid1 e1000e pm80xx dca raid0 libsas ahci ptp > multipath scsi_transport_sas libahci pps_core linear > [ 309.862627] CPU: 5 PID: 1998 Comm: kworker/5:2 Tainted: G D > 3.17.0-031700rc3-generic #201409031132 > [ 309.872709] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 2.0 04/24/2014 > [ 309.879834] task: ffff880403690000 ti: ffff880404fa4000 task.ti: > ffff880404fa4000 > [ 309.887407] RIP: 0010:[<ffffffff810936e0>] [<ffffffff810936e0>] > kthread_data+0x10/0x20 > [ 309.895562] RSP: 0018:ffff880404fa78e8 EFLAGS: 00010092 > [ 309.900939] RAX: 0000000000000000 RBX: 0000000000000005 RCX: ffffffff81ec2e80 > [ 309.908139] RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff880403690000 > [ 309.915343] RBP: ffff880404fa78e8 R08: 0000000000000000 R09: 0000000000000246 > [ 309.922549] R10: 000000000000001a R11: 0000000000000013 R12: 0000000000000005 > [ 309.929749] R13: ffff880403690538 R14: 0000000000000001 R15: 0000000000000046 > [ 309.936953] FS: 0000000000000000(0000) GS:ffff88041fd40000(0000) > knlGS:0000000000000000 > [ 309.945130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 309.950942] CR2: 0000000000000028 CR3: 0000000001c16000 CR4: 00000000001407e0 > [ 309.958140] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 309.965345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 309.972548] Stack: > [ 309.974632] ffff880404fa7908 ffffffff8108e725 ffff880404fa7908 > ffff88041fd545c0 > [ 309.982368] ffff880404fa7988 ffffffff8179f8b3 ffff880404fa7948 > ffff880403690000 > [ 309.990110] ffff880404fa7fd8 00000000000145c0 ffff880404fa7948 > 00000000000145c0 > [ 309.997858] Call Trace: > [ 310.000377] [<ffffffff8108e725>] wq_worker_sleeping+0x15/0xb0 > [ 310.006281] [<ffffffff8179f8b3>] __schedule+0x5e3/0x770 > [ 310.011666] [<ffffffff8179fb19>] schedule+0x29/0x70 > [ 310.016704] [<ffffffff81077795>] do_exit+0x2a5/0x470 > [ 310.021830] [<ffffffff810c9fbc>] ? kmsg_dump+0x9c/0xc0 > [ 310.027128] [<ffffffff81018d08>] oops_end+0xb8/0x160 > [ 310.032253] [<ffffffff81788489>] no_context+0x1be/0x1cd > [ 310.037631] [<ffffffff8178866b>] __bad_area_nosemaphore+0x1d3/0x1f2 > [ 310.044059] [<ffffffff810ababc>] ? put_prev_entity+0x3c/0x320 > [ 310.049962] [<ffffffff8178869d>] bad_area_nosemaphore+0x13/0x15 > [ 310.056043] [<ffffffff81062312>] __do_page_fault+0x3b2/0x550 > [ 310.061856] [<ffffffff810ae3aa>] ? idle_balance+0x7a/0x2c0 > [ 310.067493] [<ffffffff810ababc>] ? put_prev_entity+0x3c/0x320 > [ 310.073399] [<ffffffff810135c6>] ? __switch_to+0xf6/0x5b0 > [ 310.078960] [<ffffffff8106263e>] do_page_fault+0x3e/0x80 > [ 310.084431] [<ffffffff817a6088>] page_fault+0x28/0x30 > [ 310.089643] [<ffffffffc0080d4f>] ? > pm8001_dev_gone_notify+0x2f/0x220 [pm80xx] > [ 310.096957] [<ffffffffc00826e8>] > pm8001_I_T_nexus_event_handler+0xb8/0x1f0 [pm80xx] > [ 310.104791] [<ffffffff810ababc>] ? put_prev_entity+0x3c/0x320 > [ 310.110691] [<ffffffffc0084559>] pm8001_work_fn+0x299/0x480 [pm80xx] > [ 310.117205] [<ffffffff8108ce6f>] process_one_work+0x17f/0x490 > [ 310.123106] [<ffffffff8108d7eb>] worker_thread+0x11b/0x3f0 > [ 310.128752] [<ffffffff8108d6d0>] ? create_worker+0x1e0/0x1e0 > [ 310.134572] [<ffffffff81093349>] kthread+0xc9/0xe0 > [ 310.139519] [<ffffffff81093280>] ? flush_kthread_worker+0x90/0x90 > [ 310.145763] [<ffffffff817a3f3c>] ret_from_fork+0x7c/0xb0 > [ 310.151227] [<ffffffff81093280>] ? flush_kthread_worker+0x90/0x90 > [ 310.157473] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 > c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 c8 04 00 00 > 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 > 00 00 > [ 310.180886] RIP [<ffffffff810936e0>] kthread_data+0x10/0x20 > [ 310.186679] RSP <ffff880404fa78e8> > [ 310.190239] CR2: ffffffffffffffd8 > [ 310.193623] ---[ end trace 084eaa8941942e9b ]--- > [ 310.306147] Fixing recursive fault but reboot is needed! > > > On Wed, Sep 3, 2014 at 12:06 AM, Emmanuel Florac <eflorac@xxxxxxxxxxxxxx> wrote: >> Le Mon, 1 Sep 2014 09:06:46 -0700 >> Andrew Robertson <andyrobertson101@xxxxxxxxx> écrivait: >> >>> I'm happy to test patches/etc on this system if necessary -- and/or if >>> someone can help point me in the right direction, I'd appreciate it. >> >> In my experience the 7xxx5 are very sensitive to cable length and >> backplane type: basically work fine with 50 cm cables, and fails with 80 >> cm cables with some backplanes (works with Supermicro, not with AIC, >> etc). >> >> So what is the backplane and cables you're using? >> >> -- >> ------------------------------------------------------------------------ >> Emmanuel Florac | Direction technique >> | Intellique >> | <eflorac@xxxxxxxxxxxxxx> >> | +33 1 78 94 84 02 >> ------------------------------------------------------------------------ -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html