Re: [PATCH 0/6] qla2xxx device removal fixups

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Joe,

Thanks for the patches. We will review and update.

-- Giri

On 6/18/14 7:02 AM, "Joe Lawrence" <joe.lawrence@xxxxxxxxxxx> wrote:

>Hi Chad, Giri, et al.
>
>Stratus has been testing the upstream qla2xxx driver against surprise
>device removal and has found a few minor issues along the way.  With
>this patchset, results have been good.  Although the following changes
>may be most interesting on a Stratus platform, they should be applicable
>to general hotplug scenarios on other hardware.
>
>The first two patches are independent from the others and can be
>reviewed as such.  One fixes up a use-after-free and the other
>consolidates some code.
>
>The final four patches are more invasive and modify the scsi_qla_host
>structure to avoid device removal race conditions.  These changes were
>written to demonstrate the problem at hand and minimally fix them in
>order to continue testing.  (For example, adding completely a new
>pci_flags member to scsi_qla_host is probably overkill.)
>
>We would welcome any feedback or questions about our testing.
>
>Regards,
>
>-- Joe
>
>
>Joe Lawrence (6):
>
>  qla2xxx: Fix shost use-after-free on device removal
>
>    scsi_qla_host *base_vha is allocated as part of the Scsi_Host
>    private hostdata[] by qla2x00_create_host.  When the last reference
>    on the host is put by qla2x00_remove_one, it's released along with
>    any hostdata[], rendering base_vha unsafe to dereference.
>      
>    To safely complete the adapter cleanup in qla2x00_remove_one,
>    use the scsi_qla_host_t *qla_hw_data pointer that is already in
>    hand.
>      
>    To reproduce, set kernel boot option slub_debug=FZPU and remove
>    an adapter instance.
>
>
>  qla2xxx: Use qla2x00_clear_drv_active on probe failure
>
>    Clean up some duplicate code along the way.
>
>
>  qla2xxx: Collect PCI register checks and board_disable scheduling
>
>    PCI register read checks are performed throughout the driver to
>    detect disconnected hardware.  There is currently a function that
>    verifies 32-bit values and schedules board_removal if needed. Other
>    16-bit registers are checked and board_removal scheduling scattered
>    around the driver.
>
>    This change pulls all of these together such that a new 16-bit
>    register check function wrappers the existing 32-bit version and
>    centralizes the scheduling of board_disable work to a single
>    invocation.
>
>    The following patches depend upon this change.
>
>
>  qla2xxx: Schedule board_disable only once
>
>    In automated hot-plug device removal testing, occasionally
>    qla2x00_disable_board_on_pci_error would run twice.  I believe this
>    occurred by a second qla2x00_check_reg_for_disconnect scheduling
>    board_disable while the first instance was still running.
>
>    The easiest solution seemed to be adding a per-adapter flag that
>    could be test_and_set before scheduling board_disable.
>
>
>  qla2xxx: Prevent removal and board_disable race
>
>    This race was discovered through many iterations of automated PCI
>    hotplug.  The automated test inserts faults on the Stratus platform
>    to simulate device removal.  After removal, it re-adds the devices,
>    simulating PCI hotplug insertion.  Rinse, repeat.
>
>    The race occurs when a Stratus proprietary hotplug driver detects
>    that PCI devices have gone missing and calls into the kernel PCI
>    subsystem to remove them.  At nearly the same time, the qla2xxx
>    driver figures out the same issue and schedules a board_disable.  It
>    takes a few hundred device removals to hit, but it seemed that the
>    problem was that qla2x00_disable_board_on_pci_error started, then
>    qla2xxx_remove_one was invoked, so one of the two started touching
>    already freed resources.
>
>    Although the bug manifested on a Stratus platform running modified
>    drivers, the window for qla2xxx_remove_one to race
>    qla2x00_disable_board_on_pci_error is still present if running stock
>    kernel/drivers on commodity hardware.  It may be difficult to hit,
>    but one could theoretically invoke PCI device removal via sysfs (or
>    simply unload the qla2xxx driver) and for board_disable to run at
>    the same time.
>
>    The fix leaves the Scsi_Host intact during board_disable so that
>    qla2x00_remove_one can safely stop the main timer and
>    cancel_work_sync on any outstanding board_disable work.  A
>    PFLG_DRIVER_REMOVING bit is also set to prevent any other
>    board_disable instances from scheduling.  Once the asynchronous
>    players are out of the way, qla2x00_remove_one can move forward
>    cleaning up whatever remaining resources are still attached.
>
>    The PCI device enable count check in qla2x00_remove_one was
>    re-purposed to determine if board_disable had run.  Its original
>    intention to detect qla2x00_probe_one failure was invalid.  As long
>    as the driver .probe routine returns an -ERRNO, then its counterpart
>    .remove routine is *not* called.
>
>
>  qla2xxx: Prevent probe and board_disable race
>
>    An automated Stratus device removal test hotplug removes devices
>    shortly during/after their initialization.  The test inserts
>    devices, then proceeds to remove them at various 1, 2, 3, 4...
>    second intervals.  This test revealed two bugs:
>
>    1 - The struct isp_operations initialize_adapter callback is invoked
>    by qla2x00_probe_one before ha->board_disable is initialized. The
>    callback will eventually read PCI registers and potentially trigger
>    the disconnection check.  If the board_disable work structure hasn't
>    been initialized, an ugly WARN trace will be emitted. The call stack
>    looked something like:
>
>      qla2x00_probe_one
>        qla2x00_initialize_adapter
>          qla2xxx_get_flash_info
>            qla2xxx_get_flt_info
>              qla25xx_read_optrom_data
>                 qla2x00_dump_ram
>                   qla2x00_mailbox_command
>                     qla24xx_intr_handler
>                       qla2x00_check_reg_for_disconnect
>
>    2 - Races between the qla2x00_disable_board_on_pci_error and
>    qla2x00_probe_one. The latter was written before any asynchronous
>    device removals were introduced by commit f3ddac1 "qla2xxx: Disable
>    adapter when we encounter a PCI disconnect," and proceeds as if it
>    were the only game in town with respect to the underlying device.
>    Restore that assumption by preventing any board_disable scheduling
>    while qla2x00_probe_one is in flight.  Holding off this scheduling
>    prevents the race, as well as avoids bug #1.
>
> drivers/scsi/qla2xxx/qla_def.h |    5 ++++
> drivers/scsi/qla2xxx/qla_gbl.h |    3 +-
> drivers/scsi/qla2xxx/qla_isr.c |   44 +++++++++++++++--------------
> drivers/scsi/qla2xxx/qla_mr.c  |    2 +-
> drivers/scsi/qla2xxx/qla_nx.c  |    6 ++--
> drivers/scsi/qla2xxx/qla_os.c  |   60
>++++++++++++++++++----------------------
> 6 files changed, 61 insertions(+), 59 deletions(-)
>
>-- 
>1.7.10.4
>

<<attachment: winmail.dat>>


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux