Re: mvsas mvs_abort_task() disk hosed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It seems some of the WD green disks have a really high Load_Cycle_Count:

        Model Number:       WDC WD15EADS-00P8B0
        Serial Number:      WD-WCAVU0220972
        Firmware Revision:  01.00A01
193 Load_Cycle_Count        0x0032   167   167   000    Old_age
Always -       99576

They park their heads after only 8 secs, which was also reported here:
http://kerneltrap.org/mailarchive/linux-kernel/2008/4/10/1396844

When software raid is doing a recovery and the XFS fs on top of the
raid6 is populated with big HDTV files, the Load_Cycle_Count seems to
go up. It appears the drive parks the heads quicker than the interval
before a flush.

Version 00P8B0 of the drive also seems to be blacklisted at synology:
http://forum.synology.com/enu/viewtopic.php?f=124&t=9412

WD15EADS-00P8B0 1.5TB -- NOT SUGGESTED

Three of my drives are WD15EADS-00S2B0 and do not seem affected.

I'm currently testing with
hdparm -S 252 /dev/sdx

on all of my WD drives it seems to solve my problems for the moment
(but longer stress testing should take place to be 100% sure).

Is it normal behavior that the Marvell driver will timeout after a
high Load_Cycle_Count increment ? Or is the disk going nuts by
continously doing a load cycle ?


On Fri, Feb 5, 2010 at 12:46 PM, Audio Haven <audiohaven@xxxxxxxxx> wrote:
> I have 8 1.5T WD Green drives in a software raid6. While the initial
> sync of the raid6 works well using the mvsas driver (when using the
> patches from Andy Yan, without it always crashed during the initial
> sync), I'm getting lots of mvs_abort_task errors when using filling up
> a 9T xfs filesystem on top of the raid device /dev/md2 with large HDTV
> data.
>
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea77380
>                 slot=ffff880093724618 slot_idx=x4
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea77700
>                 slot=ffff880093724670 slot_idx=x5
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea77c40
>                 slot=ffff8800937246c8 slot_idx=x6
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea778c0
>                 slot=ffff880093724720 slot_idx=x7
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800853e78c0
>                 slot=ffff880093724988 slot_idx=xe
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800853e78c0
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800853e7a80
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800853e7000
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800853e7380
>                 slot=ffff8800937245c0 slot_idx=x3
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea77700
>                 slot=ffff880093724930 slot_idx=xd
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800853e7540
>                 slot=ffff880093724a90 slot_idx=x11
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007303b000
>                 slot=ffff880093724ae8 slot_idx=x12
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007303b380
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007303b000
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007303b540
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff880025182000
>                 slot=ffff8800937245c0 slot_idx=x3
> mvs_abort_task() mvi=ffff880093700000 task=ffff880025182380
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800251828c0
>                 slot=ffff8800937245c0 slot_idx=x3
> mvs_abort_task() mvi=ffff880093700000 task=ffff880025182700
>                 slot=ffff880093724670 slot_idx=x5
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800251821c0
>                 slot=ffff8800937246c8 slot_idx=x6
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800251821c0
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800251828c0
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff880025182700
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800251828c0
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800251828c0
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800251828c0
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff880025182380
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff880025182000
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61de00
>                 slot=ffff880093724a90 slot_idx=x11
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007303b000
>                 slot=ffff880093724ae8 slot_idx=x12
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007303bc40
>                 slot=ffff880093724d50 slot_idx=x19
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5a80
>                 slot=ffff880093724618 slot_idx=x4
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5c40
>                 slot=ffff880093724880 slot_idx=xb
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5a80
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5c40
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5540
>                 slot=ffff8800937245c0 slot_idx=x3
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c51c0
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea771c0
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea77000
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea77380
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea77a80
>                 slot=ffff8800937246c8 slot_idx=x6
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5000
>                 slot=ffff8800937247d0 slot_idx=x9
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5540
>                 slot=ffff880093724778 slot_idx=x8
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5700
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c58c0
>                 slot=ffff880093724618 slot_idx=x4
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5c40
>                 slot=ffff880093724670 slot_idx=x5
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5e00
>                 slot=ffff8800937246c8 slot_idx=x6
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5000
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5e00
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5000
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c58c0
>                 slot=ffff8800937245c0 slot_idx=x3
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5000
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5e00
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5380
>                 slot=ffff880093724720 slot_idx=x7
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c51c0
>                 slot=ffff880093724778 slot_idx=x8
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61de00
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61d700
>                 slot=ffff8800937245c0 slot_idx=x3
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61d8c0
>                 slot=ffff880093724618 slot_idx=x4
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61dc40
>                 slot=ffff880093724670 slot_idx=x5
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61dc40
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61d700
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61d8c0
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61d1c0
>                 slot=ffff8800937246c8 slot_idx=x6
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5700
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c58c0
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c51c0
>                 slot=ffff8800937245c0 slot_idx=x3
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5540
>                 slot=ffff880093724568 slot_idx=x2
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61d000
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61d540
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61d1c0
>                 slot=ffff8800937245c0 slot_idx=x3
> mvs_abort_task() mvi=ffff880093700000 task=ffff88004b61dc40
>                 slot=ffff880093724720 slot_idx=x7
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c51c0
>                 slot=ffff880093724670 slot_idx=x5
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5000
>                 slot=ffff8800937246c8 slot_idx=x6
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5c40
>                 slot=ffff8800937244b8 slot_idx=x0
> mvs_abort_task() mvi=ffff880093700000 task=ffff8800400c5380
>                 slot=ffff880093724618 slot_idx=x4
> sd 13:0:6:0: [sdj] Unhandled error code
> sd 13:0:6:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
> sd 13:0:6:0: [sdj] CDB: Read(10): 28 00 4f 14 f4 1f 00 00 08 00
> end_request: I/O error, dev sdj, sector 1326773279
> raid5:md2: read error not correctable (sector 1326773216 on sdj1).
> raid5: Disk failure on sdj1, disabling device.
> raid5: Operation continuing on 6 devices.
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea771c0
>                 slot=ffff880093724510 slot_idx=x1
> mvs_abort_task() mvi=ffff880093700000 task=ffff88007ea77e00
>                 slot=ffff880093724778 slot_idx=x8
> mvs_abort_task() mvi=ffff880093700000 task=ffff88001b008000
>                 slot=ffff8800937247d0 slot_idx=x9
> sd 13:0:6:0: [sdj] Unhandled error code
> sd 13:0:6:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
> sd 13:0:6:0: [sdj] CDB: Read(10): 28 00 4f 14 f4 37 00 00 70 00
> end_request: I/O error, dev sdj, sector 1326773303
> raid5:md2: read error not correctable (sector 1326773240 on sdj1).
> raid5:md2: read error not correctable (sector 1326773248 on sdj1).
> raid5:md2: read error not correctable (sector 1326773256 on sdj1).
>
> My system is Fedora 12  x86_64 using a 2.6.32.3 kernel with the
> following patches from Andy Yan:
>
> [PATCH 6/7]MVSAS: Enhanced hot plug handling
> [PATCH 5/7]MVSAS:Optimization for DMA buffer
> [PATCH 4/7]MVSAS:Make code more flexibe for different chip model.
> [PATCH 3/7]MVSAS: bug fix with big endian
> [PATCH 2/7]MVSAS:add supporting MSI feature
> [PATCH 1/7]MVSAS: Update chip initialization
>
> I did not use [PATCH 7/7]
>
> The only solution is to reboot the system as /dev/sdj is completely
> hosed. All IO to this disk is stalled. The issue also happened on
> /dev/sdi. Then I just rebooted and readded /dev/sdi1 to the raid set
> and it rebuild correctly. None of these disks report any smart errors
> (they all report healthy) and no sector remapping occurs as the raw
> value of Reallocated_Sector_Ct is zero for all of these disks.
>
> I had a faulty 1.5 T WD green drive before which was swapped under
> warranty with a high raw Reallocated_Sector_Ct count (so I know what
> to expect from smartctl).
>
> I'm not an expert but thinking in the following direction: timeouts:
>
> Maybe the timeouts are too low in the driver ? These drives don't
> feature TLER (time limited error recovery) which means IO could be
> stalled for 2 minutes while the disk is doing background stuff, but
> this is no reason for marking the disk dead on consumer disks. Is it
> correct that the mvsas only allows 20 seconds for a task to complete ?
> Shouldn't this be higher ? However I'm not seeing this background
> process being triggered using smartctl.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux