Linus, The following changes since commit cd8df932d894f3128c884e3ae1b2b484540513db: [SCSI] qla4xxx: Update driver version to 5.02.00-k15 (2012-02-29 17:03:03 -0600) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/isci.git tags/libsas-fixes for you to fetch changes up to 3385b6baa9f3bbf69d4c1fc58342936e75d095b1: Revert "[SCSI] libsas: fix sas port naming" (2012-04-19 23:48:12 -0700) ---------------------------------------------------------------- libsas-fixes for 3.4-rc4 Regression fixes to stabilize the new workqueue and ata asynchronous error handling implementation that was merged for v3.4-rc1. 1/ fix regression in sas_drain_work() which was stomping on 'work' entries while the workqueue was manipulating them. User sees random crashes when trying to use scsi_transport_sas attributes for resets, or during discovery. 2/ (2) longstanding bugs related to the fact that libata (inventor and primary host_eh_scheduled user) had built-in assumptions of 1:1 Scsi_Host-to-ata_port relationship. The libsas 1:N arrangement magnified these problems when it gained async eh and began scheduling eh in more scenarios (sas-transports resets) in 3.4-rc1. 3/ lifetime fixes for the rphy since code that has a domain_device reference expects to be able to de-reference rphy parameters. 4/ (3) fixes for expander discovery bugs, one a recent regression with ata-eh clobbering expander-phy data as it polled leading to system crashes, a long standing bug that caused libsas to be incompatible with expanders that advertised "PHY_VACANT" in low order phy indexes, and a quirk for expanders that sometimes fail to zero the sas address when no device is attached. 5/ fix for a long-standing bug whereby hotunplug events during initial host scan can cause a system crash 6/ fix for a mvsas regression caused by the new end-device naming in libsas making the incorrect assumption that at all phy ids exported by an lldd are unique. ---------------------------------------------------------------- These patches, save for the new "scsi: fix eh wakeup (scsi_schedule_eh vs scsi_restart_operations)" and "Revert "[SCSI] libsas: fix sas port naming", were all originally posted before the merge window opened, and have also appeared in -next for the same timeframe. The commit dates are not that aged (9 days old) because they were rebased out of larger set of updates that were pending for 3.4. There is a mix of pure regression fixes and fixes for long-standing bugs in libsas. Some of the long-standing bug fixes are made worse / easier to trigger by the new async error handling scheme. The largest patch in the series is "libata, libsas: introduce sched_eh and end_eh port ops" it has been on the list since March 10th. Jack Wang has independently tested this set with pm8001 and reports success. [1] Apologies if scsi-rc-fixes was in the process of picking these up. With -rc4 looming I lost my nerve and pulled the trigger. -- Dan [1]: http://www.spinics.net/lists/linux-scsi/msg58761.html Dan Williams (11): libsas: introduce sas_work to fix sas_drain_work vs sas_queue_work libata, libsas: introduce sched_eh and end_eh port ops libsas: fix sas_get_port_device regression libsas: unify domain_device sas_rphy lifetimes libsas: fix ata_eh clobbering ex_phys via smp_ata_check_ready libata: make ata_print_id atomic libsas, libata: fix start of life for a sas ata_port scsi: fix eh wakeup (scsi_schedule_eh vs scsi_restart_operations) libsas: fix false positive 'device attached' conditions scsi_transport_sas: fix delete vs scan race Revert "[SCSI] libsas: fix sas port naming" Maciej Trela (1): libsas: cleanup spurious calls to scsi_schedule_eh Thomas Jackson (1): libsas: fix sas_find_bcast_phy() in the presence of 'vacant' phys drivers/ata/libata-core.c | 8 +++- drivers/ata/libata-eh.c | 57 +++++++++++++++++++++------ drivers/ata/libata-scsi.c | 35 +++++++++-------- drivers/ata/libata.h | 2 +- drivers/scsi/ipr.c | 6 ++- drivers/scsi/libsas/sas_ata.c | 72 +++++++++++++++++++++-------------- drivers/scsi/libsas/sas_discover.c | 67 ++++++++++++++++++-------------- drivers/scsi/libsas/sas_event.c | 36 +++++++++--------- drivers/scsi/libsas/sas_expander.c | 56 +++++++++++++++++++++------ drivers/scsi/libsas/sas_init.c | 25 ++++++------ drivers/scsi/libsas/sas_internal.h | 6 +-- drivers/scsi/libsas/sas_phy.c | 21 ++++------ drivers/scsi/libsas/sas_port.c | 17 +++------ drivers/scsi/libsas/sas_scsi_host.c | 28 ++++++++++---- drivers/scsi/scsi_error.c | 14 +++++++ drivers/scsi/scsi_transport_sas.c | 6 ++- include/linux/libata.h | 7 +++- include/scsi/libsas.h | 44 ++++++++++++++++++--- include/scsi/sas_ata.h | 9 ++++- 19 files changed, 344 insertions(+), 172 deletions(-) commit 3385b6baa9f3bbf69d4c1fc58342936e75d095b1 Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Thu Apr 19 23:48:12 2012 -0700 Revert "[SCSI] libsas: fix sas port naming" This reverts commit a692b0eec5efae382dfa800e8b4b083f172921a7. Tom reports: [ 8.741033] ------------[ cut here ]------------ [ 8.741038] WARNING: at fs/sysfs/dir.c:508 sysfs_add_one+0xc1/0xf0() [ 8.741040] Hardware name: To Be Filled By O.E.M. [ 8.741041] sysfs: cannot create duplicate filename ...and missing 2 out of 4 drives connected to mvsas. Commit a692b0ee made the assumption that all the phy ids an lldd registers to libsas are unique. However, in the "multi-chip" case mvsas does a rather annoying duplication of phy ids in the array passed to libsas. So, for example, chip0 has phy0-3 at ha phy index 0-3 and chip1 has its phy0-3 at ha phy index 4-7. The more natural model would be to create a scsi_host (and sas_ha) per chip (controller), but for now revert the naming fix which unfortunately means dealing with unpredictable end-device names for a bit longer. Cc: Xiangliang Yu <yuxiangl@xxxxxxxxxxx> Cc: Patrick Thomson <patrick.s.thomson@xxxxxxxxx> Reported-by: Tom Rini <trini@xxxxxx> Tested-by: Tom Rini <trini@xxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit e81dcce46fdbb2c968d7314c2f19da3c2bba24d1 Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Tue Mar 20 10:58:38 2012 -0700 scsi_transport_sas: fix delete vs scan race The following crash results from cases where the end_device has been removed before scsi_sysfs_add_sdev has had a chance to run. BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 IP: [<ffffffff8115e100>] sysfs_create_dir+0x32/0xb6 ... Call Trace: [<ffffffff8125e4a8>] kobject_add_internal+0x120/0x1e3 [<ffffffff81075149>] ? trace_hardirqs_on+0xd/0xf [<ffffffff8125e641>] kobject_add_varg+0x41/0x50 [<ffffffff8125e70b>] kobject_add+0x64/0x66 [<ffffffff8131122b>] device_add+0x12d/0x63a [<ffffffff814b65ea>] ? _raw_spin_unlock_irqrestore+0x47/0x56 [<ffffffff8107de15>] ? module_refcount+0x89/0xa0 [<ffffffff8132f348>] scsi_sysfs_add_sdev+0x4e/0x28a [<ffffffff8132dcbb>] do_scan_async+0x9c/0x145 ...teach sas_rphy_remove to wait for async scanning to quiesce before removing the end_device. It seems this is a more general problem [1], but this patch only addresses sas transport. [1]: 23edb6e [SCSI] mpt2sas: Do not set sas_device->starget to NULL from the slave_destroy callback when all the LUNS have been deleted Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit 55c53f6aed389e9e789df8d8e65d728ac125dba1 Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Tue Mar 20 10:50:27 2012 -0700 libsas: fix false positive 'device attached' conditions Normalize phy->attached_sas_addr to return a zero-address in the case when device-type == NO_DEVICE or the linkrate is invalid to handle expanders that put non-zero sas addresses in the discovery response: sas: ex 5001b4da000f903f phy02:U:0 attached: 0100000000000000 (no device) sas: ex 5001b4da000f903f phy01:U:0 attached: 0100000000000000 (no device) sas: ex 5001b4da000f903f phy03:U:0 attached: 0100000000000000 (no device) sas: ex 5001b4da000f903f phy00:U:0 attached: 0100000000000000 (no device) Reported-by: Andrzej Jakowski <andrzej.jakowski@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit fcc1ce20ffbc553b25b6c635f4bb838940f58d2d Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Fri Apr 6 16:35:36 2012 -0700 scsi: fix eh wakeup (scsi_schedule_eh vs scsi_restart_operations) Rapid ata hotplug on a libsas controller results in cases where libsas is waiting indefinitely on eh to perform an ata probe. A race exists between scsi_schedule_eh() and scsi_restart_operations() in the case when scsi_restart_operations() issues i/o to other devices in the sas domain. When this happens the host state transitions from SHOST_RECOVERY (set by scsi_schedule_eh) back to SHOST_RUNNING and ->host_busy is non-zero so we put the eh thread to sleep even though ->host_eh_scheduled is active. Before putting the error handler to sleep we need to check if the host_state needs to return to SHOST_RECOVERY for another trip through eh. Cc: Tejun Heo <tj@xxxxxxxxxx> Reported-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx> Tested-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit fcf62bdd26101fe6ae8760c5e9eb4d5e49e0a5ec Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Wed Mar 21 21:09:07 2012 -0700 libsas, libata: fix start of life for a sas ata_port This changes the ordering of initialization and probing events from: 1/ allocate rphy in PORTE_BYTES_DMAED, DISCE_REVALIDATE_DOMAIN 2/ allocate ata_port and schedule port probe in DISCE_PROBE ...to: 1/ allocate ata_port in PORTE_BYTES_DMAED, DISCE_REVALIDATE_DOMAIN 2/ allocate rphy in PORTE_BYTES_DMAED, DISCE_REVALIDATE_DOMAIN 3/ schedule port probe in DISCE_PROBE This ordering prevents PHYE_SIGNAL_LOSS_EVENTS from sneaking in to destrory ata devices before they have been fully initialized: BUG: unable to handle kernel paging request at 0000000000003b10 IP: [<ffffffffa0053d7e>] sas_ata_end_eh+0x12/0x5e [libsas] ... [<ffffffffa004d1af>] sas_unregister_common_dev+0x78/0xc9 [libsas] [<ffffffffa004d4d4>] sas_unregister_dev+0x4f/0xad [libsas] [<ffffffffa004d5b1>] sas_unregister_domain_devices+0x7f/0xbf [libsas] [<ffffffffa004c487>] sas_deform_port+0x61/0x1b8 [libsas] [<ffffffffa004bed0>] sas_phye_loss_of_signal+0x29/0x2b [libsas] ...and kills the awkward "sata domain_device briefly existing in the domain without an ata_port" state. Reported-by: Michal Kosciowski <michal.kosciowski@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit cb7e940b56fc8a67a6a17bc7935268f7b128f90d Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Wed Mar 21 21:09:05 2012 -0700 libata: make ata_print_id atomic This variable is incremented from multiple contexts (module_init via libata-lldds and the libsas discovery thread). Make it atomic to head off any chance of libsas and libata creating duplicate ids. Acked-by: Jacek Danecki <jacek.danecki@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit 6ec4dacc7c11b5999abe78f9a7e0125062b1d660 Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Tue Mar 20 13:24:29 2012 -0700 libsas: fix ata_eh clobbering ex_phys via smp_ata_check_ready The check_ready implementation in the expander-attached ata device case polls on sas_ex_phy_discover(). The effect is that the ex_phy fields (critically ->attached_sas_addr) can change. When ata_eh ends and libsas comes along to revalidate the domain sas_unregister_devs_sas_addr() can fail to lookup devices to remove, or fail to re-add an ata device that ata_eh marked as disabled. So change the code to skip the sas_address and change count updates when ata_eh is active. Cc: Jack Wang <jack_wang@xxxxxxxxx> Tested-by: Maciej Patelczyk <maciej.patelczyk@xxxxxxxxx> Tested-by: Bartek Nowakowski <bartek.nowakowski@xxxxxxxxx> Tested-by: Jacek Danecki <jacek.danecki@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit db25a56d901cfc259240d6b6cf999170d7f35fff Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Tue Mar 20 10:53:24 2012 -0700 libsas: unify domain_device sas_rphy lifetimes Since the domain_device can out live the scsi_target we need the rphy to follow suit otherwise we run into issues like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 IP: [<ffffffffa011561b>] sas_ata_printk+0x43/0x6f [libsas] PGD 0 Oops: 0000 [#1] SMP CPU 1 Modules linked in: ses enclosure isci libsas scsi_transport_sas fuse sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf microcode pcspkr igb joydev iTCO_wdt ioatdma iTCO_vendor_support i2c_i801 i2c_core dca wmi hed ipv6 pata_acpi ata_generic [last unloaded: scsi_wait_scan] Pid: 129, comm: kworker/u:3 Not tainted 3.3.0-rc5-isci+ #1 Intel Corporation SandyBridge Platform/To be filled by O.E.M. RIP: 0010:[<ffffffffa011561b>] [<ffffffffa011561b>] sas_ata_printk+0x43/0x6f [libsas] RSP: 0018:ffff88042232dd70 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff8804283165b8 RCX: ffff88042232dda0 RDX: ffff88042232dd78 RSI: ffff8804283165b8 RDI: ffffffffa01188d7 RBP: ffff88042232ddd0 R08: ffff880388454000 R09: ffff8803edfde1f8 R10: ffff8803edfde1f8 R11: ffff8803edfde1f8 R12: ffff880428316750 R13: ffff880388454000 R14: ffff8803f88b31d0 R15: ffff8803f8b21d50 FS: 0000000000000000(0000) GS:ffff88042ee20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000050 CR3: 0000000001a05000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/u:3 (pid: 129, threadinfo ffff88042232c000, task ffff88042230c920) Stack: 0000000000000000 ffff880400000018 ffff88042232dde0 ffff88042232dda0 ffffffffa01188c4 ffff88042ee93af0 ffff88042232ddb0 ffffffff8100e047 ffff88042232de10 ffff880420e5a2c8 ffff8803f8b21d50 ffff8803edfde1f8 Call Trace: [<ffffffff8100e047>] ? load_TLS+0xb/0xf [<ffffffffa01156ad>] async_sas_ata_eh+0x66/0x95 [libsas] [<ffffffff810655e1>] async_run_entry_fn+0x9e/0x131 Reported-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx> Tested-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit 6be254f019fd8dadc63cc63ded75d2422e2057b7 Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Mon Mar 12 11:38:26 2012 -0700 libsas: fix sas_get_port_device regression Commit 899fcf4 "[SCSI] libsas: set attached device type and target protocols for local phys" setup 'phy' to be dereferenced after list_for_each_entry(phy, &port->phy_list, port_phy_el) (i.e. phy == &port->phy_list) resulting in reports like: BUG: unable to handle kernel NULL pointer dereference at 00000000000002b0 IP: [<ffffffffa00ce948>] sas_discover_domain+0x29e/0x4fb [libsas] ...fix by deferring sas_phy_set_target() to the end of sas_get_port_device(). Reported-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx> Tested-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit 71cb71d183256fbe77f35558606989c8f47c4ff0 Author: Thomas Jackson <thomas.p.jackson@xxxxxxxxx> Date: Fri Feb 17 18:33:10 2012 -0800 libsas: fix sas_find_bcast_phy() in the presence of 'vacant' phys If an expander reports 'PHY VACANT' for a phy index prior to the one that generated a BCN libsas fails rediscovery. Since a vacant phy is defined as a valid phy index that will never have an attached device just continue the search. Cc: <stable@xxxxxxxxxxxxxxx> Signed-off-by: Thomas Jackson <thomas.p.jackson@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit 705885cb7b906ebddafbaedd693c355f8350ac4e Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Thu Mar 1 18:44:25 2012 -0800 libata, libsas: introduce sched_eh and end_eh port ops When managing shost->host_eh_scheduled libata assumes that there is a 1:1 shost-to-ata_port relationship. libsas creates a 1:N relationship so it needs to manage host_eh_scheduled cumulatively at the host level. The sched_eh and end_eh port port ops allow libsas to track when domain devices enter/leave the "eh-pending" state under ha->lock (previously named ha->state_lock, but it is no longer just a lock for ha->state changes). Since host_eh_scheduled indicates eh without backing commands pinning the device it can be deallocated at any time. Move the taking of the domain_device reference under the port_lock to guarantee that the ata_port stays around for the duration of eh. Cc: Tejun Heo <tj@xxxxxxxxxx> Acked-by: Jacek Danecki <jacek.danecki@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit 3c1dbbd2529c659745c047c449037e4f94d326cb Author: Maciej Trela <maciej.trela@xxxxxxxxx> Date: Sun Mar 4 17:58:55 2012 -0800 libsas: cleanup spurious calls to scsi_schedule_eh eh is woken up automatically by the presence of failed commands, scsi_schedule_eh is reserved for cases where there are no failed commands. This guarantees that host_eh_sceduled is only incremented when an explicit eh request is made. Reviewed-by: Jacek Danecki <jacek.danecki@xxxxxxxxx> Signed-off-by: Maciej Trela <maciej.trela@xxxxxxxxx> [fixed spurious delete of sas_ata_task_abort] Signed-off-by: Artur Wojcik <artur.wojcik@xxxxxxxxx> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> commit 63494f1cc2022fd9271c0af3399df3bc7dbec55c Author: Dan Williams <dan.j.williams@xxxxxxxxx> Date: Fri Mar 9 11:00:06 2012 -0800 libsas: introduce sas_work to fix sas_drain_work vs sas_queue_work When requeuing work to a draining workqueue the last work instance may not be idle, so sas_queue_work() must not touch work->entry. Introduce sas_work with a drain_node list_head to have a private list for collecting work deferred due to drain collision. Fixes reports like: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810410d4>] process_one_work+0x2e/0x338 Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html